Listen to this article
It will be hard to tell whether the next Darth Vader voice you hear will be James Earl Jones or an AI voice clone version of the legendary actor.
Jones' plan to step back from playing the famous Star Wars movie role while allowing his voice to be re-created with AI technology shows the dramatic growth of voice AI technology in recent years. However, it also highlights some ethical concerns.
"James Earl Jones is the masterpiece when it comes to deeply modulated voice," said Andy Thurai, analyst at Constellation Research. "His graciously donating his voice to create voice cloning for conversational AI would be great. I see this growing in many areas."
The terms under which Jones signed over the rights to his voice to AI vendor Respeecher are not publicly known.
While enterprise use of the technology is still sparse, as more organizations start to employ conversational AI tools, the global voice cloning market size could surpass the $5 billion mark by the end of the decade, according to some market research findings.
Voice cloning comes in various forms, including AI technology that translates text into speech, as well as speech into speech, when AI mimics a person's voice. Speech-to-speech is the type of voice cloning that will clone Jones' voice.
Respeecher, based in Ukraine, helps content creators clone voices using machine learning and AI.
Dmytro BielievtsovCo-founder, Respeecher
"Speech-to-speech is more for controllability and fine emotion," said Respeecher's Dmytro Bielievtsov, who co-founded the company in 2018.
The vendor's technology uses a conversion system in which its algorithms are exposed to speech from different speakers. The system learns how speech works on a phonetic level and the different sounds humans can make, and then imitates it.
Other than imitating, it also interprets the different inflections, intonations and accents of speech to translate that into new speech using the knowledge it previously learned.
The technology is common in Hollywood, especially for scenes where stand-in actors are required, Bielievtsov said. "This simplifies things, and any actor who's good could become a voice stand-in," he said.
Previously, Respeecher has used the technology for several projects, including Mark Hamill's voice in The Mandalorian, in which the technology was used to de-age the actor's voice. The vendor also worked on a Super Bowl project about the late football coaching great Vince Lombardi.
Despite its famous clientele, Respeecher has had to work to refine its technology so that the voice the AI spits out is not mechanical, but sounds realistic.
"Getting really high-quality speech is challenging," Bielievtsov said. "Making just the sound quality high enough and not to have too many artifacts, that is a challenge that took us a lot of time to nail down."
Another challenge the vendor faces is data efficiency and the length of time it takes to train the AI model, Bielievtsov continued. While the vendor would like to train the AI model on as little as five seconds of someone's speech, it currently needs about five minutes. The vendor is working on optimizing the rendering so that the model requires less time.
In text-to-speech voice cloning, the AI model translates written text into speech.
Text-to-speech enables users to create a variety of tones, accents and languages. It is also known as synthetic speech.
Many of the big technology vendors have text-to-speech offerings. For example, the Google Cloud text-to-speech API powers Custom Voice, a capability that lets developers train a custom voice model using audio recordings. Microsoft Azure also has a capability that lets users build a voice for text-to-speech apps. And Nvidia has showcased speech synthesis tools that developers can use with avatars for the vendor's Omniverse platform and conversational AI offerings.
AI avatars and digital humans in the metaverse have driven interest in text-to-speech voice cloning, according to Gartner analyst Annette Jump.
"[There is] interest in AI avatars as part of conversational AI -- kind of humanizing virtual assistants, or using avatars or digital humans in the metaverse where you create a digital version of yourself or a digital representative for your company," Jump said.
Cloned speech can be useful in call centers, helping agents to speak in a different language, and in settings where a digital avatar is needed.
For example, for enterprises that use digital avatars to support call center agents, synthetic speech can communicate with consumers before they're transferred to an actual agent. Admissions departments can also use synthetic speech with digital avatars to attract new students. And consumers can converse with an avatar as if they are talking to a real person.
"The quality of synthetic voice has significantly improved today versus five or seven years ago," Jump added. "But there are certainly ways of improving it further in terms of more different languages or dialects."
Speech-to-speech can also be used when people might have trouble understanding call center agents because of varied accents. "They might want to adjust that in real time without taking away too much of their emotion and whatever they want to express," Bielievtsov said.
Other enterprise applications could include the technology used in an advisory capacity, according to Thurai. "If I had a question about anything, I could talk to someone to get an answer live in their own voice, but not with a real person," he said.
While Respeecher is mainly aiming its speech-to-speech technology at Hollywood filmmakers and directors, the company wants to branch into healthcare.
For example, patients who have had their vocal cords removed could have them replaced with a device with embedded speech-to-speech AI technology.
"A real-time voice conversion device for medical purposes would make their lives much better in the sense that the voice would sound more natural," Bielievtsov said.
While the possibilities for voice cloning technologies -- both synthetic voice and speech-to-speech -- continue to grow, other applications illustrate the ethical questions the technology raises.
Most recently, Podcast.ai released a podcast of an interview between Steve Jobs and Joe Rogan. The podcast sounds like it includes the real voices of both individuals, but it was generated entirely by AI.
"As technology progresses, certainly, it's quite easy that you need only a very limited amount of someone's words to be able to clone it," Jump said.
This brings the question of privacy to the forefront because the technology could make it easy for bad actors to misuse people's voices.
"You are just one hacker away from someone cloning your voice to hack into your systems that are voice-activated," Thurai said.
Voice cloning can also have a substantial influence on political campaigns.
"A lot of fake videos and audio can start to circulate during the political campaign season, trying to accuse the opponents of something," Thurai added. "It is going to become extremely difficult to prove the origin of the video or audio."
Another concern is finding a way to use the technology to support humans instead of replacing them, as in the James Earl Jones case, according to Yugal Joshi, analyst at Everest Group.
"Eventually, they will replace people through these systems, and they have to strategize and plan for that day," he said.
"The key challenge of these systems is they normally fail in the real world," Joshi added. "When a dedicated use case like replicating James' voice or Steve Jobs' is there, you also have the chance to edit out and improve. In a real-life scenario, if damage is done, it is done. There are no retakes."