OpenAI's new speech-to-speech model, designed for more natural speech and reasoning, shows the continual evolution in speech-to-speech technology and the increasing ways AI voice is becoming more indistinguishable from human voice.
On Aug. 28, the AI vendor introduced gpt-realtime and new API capabilities, including Model Context Protocol (MCP) server support, image input and phone calling through the Session Initiation Protocol (SIP). SIP is a protocol for initiating, managing and terminating multimedia communication sessions like voice and video calling, instant messaging and gaming over IP networks.
OpenAI said the new speech-to-speech gpt-realtime is good at interpreting system messages and developer prompts. This means the model can read disclaimer scripts word-for-word on a support call, switch between languages midsentence or repeat alphanumeric passages back to the user. OpenAI also released two new voices, Cedar and Marin, which are available in Realtime API.
Image inputs in gpt-realtime also allow users to add images, photos, screenshots and audio or text to Realtime API. OpenAI introduced Realtime API last October and it is now generally available, along with the new speech model.
Some benefits
The model is best for applications where natural-sounding voice agents will thrive.
"Gpt-realtime unifies speech recognition, reasoning and speech generation in a single model, eliminating the latency of multimodel pipelines," said Arun Chandrasekaran, an analyst at Gartner. "This makes it suited for real-time, voice-first applications where fluidity and speed are critical."
He added that customer support and contact centers will benefit from the expressive multilingual voices. Moreover, the education and healthcare sectors can use them for tutoring or patient engagement.
Chandrasekaran said the new voices are also beneficial for human expressiveness.
"It follows instructions faithfully, with the promise of smoother emotional inflection," he said.
The new model is a nice evolution from a UX perspective, said David Nicholson, an analyst at The Futurum Group.
"Some new voices sound more natural, [which] will delight some and freak others out," he said. "It is still not the most natural, but the most streamlined 'back end' now."
He added that developers previously needed separate models for automatic speech recognition, language understanding and text-to-speech.
"The unified speech-to-speech pipeline simplifies integration," Nicholson said. "That's important for developers who will like the simplified workflow."
Some challenges
The new model, however, comes with some challenges.
For one, Nicholson said that his tests on 5G and home Wi-Fi show that the model is "still not perfectly real time."
Right now, we at least have indications that we are talking to AI at times.
David NicholsonAnalyst, The Futurum Group
He added that the delay will improve over time and might even alleviate the eeriness of how real AI speech is getting.
"Right now, we at least have indications that we are talking to AI at times," he said. "Once latency is reduced enough, things get scary."
Many consumers already have a tough time distinguishing between what is AI and what is not AI.
"Regulatory scrutiny around voice impersonation is a major looming challenge," Chandrasekaran said.
According to OpenAI, Realtime API has safeguards that help prevent misuse. Developers can also add their own safety guardrails with the Agents SDK.
Chandrasekaran added that another challenge with the speech-to-speech model is the 32k context window. He said that compared with rivals, it is small and limits long-form applications or applications that rely heavily on memory.
"The 32k limit supports extended conversations and multimodal tasks but restricts very long dialogues or enterprise document processing," he said.
The gpt-realtime model costs $32 per 1 million token input and $64 per 1 million token output. OpenAI also revealed that MCP support is now available in Realtime API.
Esther Shittu is an Informa TechTarget news writer and podcast host covering artificial intelligence software and systems.