When the coronavirus hit, Intelligent Voice, a vendor of high-performance speech recognition services in Great Britain, pivoted quickly. Within months, the vendor launched a new product, Myna, which connects to virtual meeting tools, enabling users to automatically record and transcribe their meetings.
To achieve such a fast turnaround on Myna, Intelligent Voice used Nvidia Jarvis, a framework for building conversational AI models on GPUs that Nvidia first unveiled in May.
While Nvidia put Jarvis into open beta during its GTC 2020 virtual conference Oct. 5, Intelligent Voice, an early adopter of Jarvis, has been using it for months.
Intelligent Voice has experience with Nvidia products. The speech recognition vendor has worked with Nvidia, a major vendor of GPU chips and AI software, for seven years. Then Intelligent Voice helped pioneer the use of GPU cards for speech recognition in 2014, said Nigel Cannings, CTO of Intelligent Voice.
Nvidia and Intelligent Voice have worked on several engineering projects for speech recognition since then, Cannings said, so it made sense for the company to pilot Jarvis in early access.
Jarvis is essentially an extension of Nvidia Triton Inference Server, open source inference-serving software designed to simplify AI models' deployment at scale, and Nvidia NeMo, a Python toolkit for building and training GPU-accelerated conversational AI models.
While NeMo provided the building block for training conversational AI models, and Triton was the framework for it, there was no way to bring it all together in an end-to-end system, Cannings said. Jarvis, then, creates that complete system.
But, for Intelligent Voice, Jarvis wasn't an immediate success. Initially, the framework wasn't particularly fast or optimized, and it faced some "serious bottlenecks" in GPU inferencing performance when working with large amounts of audio, according to Cannings.
For a vendor whose customers process millions of hours of audio, that was a problem.
"One of the things that we've always sold on is the fact that we can take a GPU card and ... push through 1,000 hours of audio in an hour," Cannings said. "We were seeing speeds well below that with Jarvis."
The optimizations, resulting in the current open beta version of Jarvis, have enabled Intelligent Voice to hit its required speeds, Cannings said.
Nigel CanningsCTO, Intelligent Voice
Jarvis can take historically independent models, such as those for speech recognition or natural language processing, and merge them instead of treating each of them as separate inputs and outputs.
Jarvis is "capable of looking at all of the different layers within those models and fusing them, and actually treating it as a single end-to-end entity, even though I've trained it as four separate models," Cannings said.
That, along with the optimizations Nvidia made, has enabled Intelligent Voice to see extremely low latency numbers with Jarvis.
"I've not seen anything perform as well as this, end-to-end," Cannings said.
It's significantly easier to train models with less data using Jarvis versus historical frameworks, and much quicker to train and deploy models in general, he said.
"We've gone from the point where perhaps something which might have taken 10s of man-years of engineering has been reduced to a much smaller time to market," Cannings added.
Still, he added, Jarvis isn't something users can download and instantly begin creating models on. It's a complex system and requires users to have a strong understanding of their business and technical problems.
"If you understand the use case, then it becomes pretty easy to adapt to your domain and get a product to market quickly," Cannings said.