As developers look to build more AI-infused apps, many will turn to cloud-based speech-to-text services.
These speech-to-text services -- which are part of the artificial intelligence portfolios that public cloud providers continue to build out or offered by third-parties -- are still in their early days. However, they continue to evolve with capabilities, such as enhanced and automated punctuation, and will likely continue to improve as providers develop more accurate speech processing models.
For example, Amazon Transcribe, Microsoft Azure Speech to Text, Google Cloud Speech-to-Text, Speechmatics ASR, and IBM Watson Speech to Text API enable developers to create dictation applications that can automatically generate transcriptions for audio files, as well as captions for video files. Call management platforms like Nexmo also provide access to transcription services that can be woven into more sophisticated call management workflows. Development teams can weave these capabilities into timesaving apps for a range of uses, including call center analytics, business transcription workflows, and video and web conference indexing. The biggest benefit of these speech synthesis services, which are frequently delivered as APIs, is their ability to integrate with the broader platform of tools and services on which they run. They also have some important differences.
Follow this speech-to-text services comparison to analyze the offerings from AWS, Microsoft, Google, IBM, Speechmatics and Nexmo.
Azure Speech to Text
One of the strengths of Microsoft Azure Speech to Text is its support for custom speech and acoustic models, which enables developers to customize speech recognition software for a particular environment. A custom language model, for example, could improve transcription accuracy for a regional dialect, while a custom acoustic model could improve accuracy for a headset used in a call center. However, Microsoft charges an additional fee for the use of these custom models.
Developers can also code applications to deliver recognition results in real time; this could enable an application to give users feedback to speak more clearly or to pause when their words are not being properly recognized.
A recent innovation is the Microsoft Conversation Transcription service that can improve the transcription from live gatherings using three speakers on separate smartphones or laptops. Microsoft has also added support for a speaker verification service that confirms the identity of speakers based on their voice.
This speech-to-text AWS offering has recognition software that can automatically recognize multiple speakers and provide a timestamp, which makes it easier for users to locate the audio or video segment associated with a specific sentence. However, the service currently only supports English and Spanish.
Amazon has recently added support for diarization -- different speakers in an audio and attributing the text to them in the transcription. It also now supports punctuation and formatting.
Google Cloud Speech-to-Text
Google has updated its speech-to-text engine to process both short audio snippets for voice interfaces and longer audio for transcription. The service can transcribe 120 languages in real time or from prerecorded audio files. It also includes a new proper noun processing engine that improves formatting for words that involve company or celebrity names.
The voice-to-text software supports several prebuilt transcription models for various use cases that improve accuracy for phone calls, video recordings or professionally recorded video. It supports audio formats such as FLAC, AMR, PCMU and WAV files. Also, SDKs are available for C#, Go, Java, Node.js, PHP, Python and Ruby. Google has also optimized the service to transcribe noisy audio without requiring additional noise cancellation.
Recent enhancements to this Google service include speaker diarization to automatically guess which speakers are talking on a shared channel of audio and automatic punctuation. It can also diarize audio using separate audio channels, such as a phone call, to improve speaker recognition.
The technical capabilities of these tools are critical, but any enterprise that's conducting a speech-to-text service comparison will obviously need to weigh those factors against the costs to run these services.
AWS, Microsoft and Google all provide a free tier to let developers test these speech-to-text services, for a limited number of minutes or hours per month. From there, Azure Speech to Text costs $1 per audio hour for standard, $1.40 for customer speech and $2.10 for conversation transcription. Amazon Transcribe costs approximately $1.44 per hour. Google Cloud Speech-to-Text standard model costs $0.006 for audio per second up to a million minutes and $0.009 per second for video and enhanced phone call models -- there are discounts if you let Google log the data. IBM Watson text-to-speech is $0.02 per thousand characters, but custom models can be more expensive.
IBM Watson Text to Speech API
IBM's transcription offering supports three different interfaces -- WebSocket, HTTP Rest and asynchronous HTTP -- for submitting audio to be transcribed. Enterprises can also choose to customize interfaces for various purposes, such as phonetic translations. The API also includes ancillary services such as keyword spotting, a profanity filter and per-word confidence scores. It also can recommend alternate phrases when confidence is low.
IBM also provides a mobile SDK which makes it easier to weave the service into mobile apps. IBM is one of the most expensive offerings, but it also simplifies integration into the company's other cognitive services.
Speechmatics is a U.K. company that focuses exclusively on optimizing its transcription engine for different enterprise use cases. The speech-to-text service can run in batch mode to transcribe prerecorded files, or in real time for low-latency use cases such as live-broadcast captioning. The service can be provisioned in multiple ways, such as a SaaS model via the Speechmatics Cloud, on premises and on a VM in the public cloud.
Speechmatics supports 29 languages and provides advanced punctuation, custom dictionaries and the ability to detect speaker changes. It can also update real-time transcription to match context.
Nexmo is a cloud application development service built on top of the Vonage Internet Telephony platform. It enables developers to create custom applications that weave together call centers, messaging and authentication services. The service does not natively support transcription services. However, it includes APIs -- SMS and voice -- that make it easy to send audio to AWS, Azure, Google and IBM transcription services.
The Nexmo service can also record up to 32 separate channels in a large audio recording, which could make it easier to attribute text to multiple speakers in a larger teleconference.
Quality still a factor in speech-to-text service comparisons
For the moment, these speech-to-text services are likely to complement -- rather than replace -- other input modalities. Still, they can provide value, especially by indexing large blocks of audio for compliance and customer service purposes or automatically generating captions for audio and video streams.
In cases where accuracy is paramount, developers should bake these tools into workflows that complement human transcribers. Developers can also use recording samples from existing sources to test the accuracy of these engines -- similar to an approach taken by Florida Institute of Technology researchers who developed a tool to analyze the quality of the different cloud speech engines. There's also a speech recognition benchmark on GitHub that includes support for the different cloud service APIs, while a separate benchmark tool from Adobe Research found Speechmatics had the highest accuracy.