As someone who has produced multimedia content for nearly 30 years, Josh Cavalier is no stranger to AI technologies and, recently, generative AI tools.
Earlier this month, Cavalier, founder of a website that offers tutorials on how to create educational videos and instructional video strategies, made a video about the popular new generative AI system ChatGPT, using ChatGPT and AI voice speech software from ElevenLabs.
The video shows a back-and-forth interaction between Cavalier and a ChatGPT robot speaking with a human-like voice.
To create the voice, Cavalier loaded the robot's script into ElevenLabs, then generated and downloaded the audio.
"ElevenLabs impressed me in regards to the quality," Cavalier said, adding that he tried other AI text-to-speech voice cloning platforms, including Wellsaid Labs and Murf AI. "For me, a lot of it is in the selection, in motion and pausing. I call it intentional pausing, which, depending upon the cadence in the emotion, there's either going to be less pausing that's happening or more pausing that's going on. And that platform seems to get it, which is pretty impressive."
After posting the initial explainer video, Cavalier also posted other videos on his YouTube channel using the same ChatGPT robot.
For a startup that only launched its VoiceLabs platform at the beginning of this year, ElevenLabs has already gained considerable traction among consumers, industry analysts and organizations.
TechTarget Editorial has used the platform to voice a news story.
The voice cloning research lab was founded in 2022 by former Google engineer Piotr Dabkowski and ex-Palantir strategist Mati Staniszewski, both natives of Poland.
The idea for the research lab -- which has now become a vendor -- emerged after both men noticed that in Poland, when consumers watch foreign content, they usually rely on one monotone narrator speaking in Polish.
"We understood and started realizing how big of a problem that is -- that you usually cannot make content multilingual because it's just too expensive," Staniszewski said.
This problem led to the birth of ElevenLabs and its mission of developing technology that can make content available in any language or voice.
The startup spent its first year researching and refining its text-to-speech, voice cloning and voice creation technology. For Cavalier, the startup turned out to be a startlingly effective tool.
"The one thing that separates ElevenLabs is how natural the character sounds based upon your input," Cavalier said.
Using the platform to train his voice, Cavalier noted that after trying 18 or 19 samples from the 25 prompts provided by ElevenLabs, the platform was getting closer and closer to replicating his voice.
"It's pretty scary but pretty cool at the same time," he said. "If I need to do a voiceover and I just have a script and I want to knock it out really fast, I can use the platform to do it."
AI personified and a blurry line
When used with large language models such as ChatGPT, voice cloning platforms such as ElevenLabs represent "the beginning of personifying AI," said Mike Gualtieri, an analyst at Forrester Research.
"ElevenLabs is going to see faster success because of large language models like ChatGPT," Gualtieri said. "There's a lot of use cases for ElevenLabs for just regular content marketing standpoint. But when you combine it with ChatGPT, I think it doubles the whole number of use cases for it."
Beyond marketing, it can also be used to generate videos with ChatGPT, such as educational content or personalized messages from key players in enterprises
A key application could be marketing programs that combine ChatGPT and ElevenLabs to generate personalized marketing messages for consumers. For example, a consumer who is looking to buy an item like a refrigerator could receive a personalized message from a refrigerator manufacturer about why they should choose their product.
However, as messages become personalized and voice cloning technology improves quickly, the lines between real and fake can become blurry.
Mike GualtieriAnalyst, Forrester Research
Less than a month after VoiceLabs was released, reports emerged that some used ElevenLabs' technology to generate voice clips that sound like celebrities, such as actor Emma Watson; Democratic congresswoman Alexandria Ocasio-Cortez; and Republican Ben Shapiro, political podcaster and radio host.
Internet forum 4Chan published numerous posts of deepfakes videos of the celebrities making racist, sexist and homophobic statements they never uttered.
In response to bad actors misusing the technology, ElevenLabs said on Jan. 31 it will add safeguards to trace if its technology was misused. Among other safeguards, the vendor said, is introducing paid tiers that require verification. The free trial version has a character limit .
Trying to keep bad actors from misusing ElevenLab's technology is challenging, Staniszewski said.
"It's problematic because the traditional approach in the space is relatively hard," he said.
Traditionally, checking specific words to verify a voice -- the traditional approach -- could also be tampered with, especially if the words were computer-generated, Staniszewski noted.
"We believe that over time, this could get even harder," he said. "We can help on a contractual basis. But then, hopefully, the wider field will adopt what's the right approach, and how to verify voices will catch up."
Currently, ElevenLabs requires written permission from the voice owner when they use the technology for professional voice cloning, he said.
Enterprises may have to be the ones to decide how they will present to consumers whether what they're hearing or seeing is real or fake, according to Gualtieri.
Some companies may decide to mark what is AI-generated up front, while others may choose not to do that.
"In some ways, we're judging AI at a higher standard that we judge people," Gualtieri said. "We expect that there are bad people that do bad things. And there're good people trying to prevent those things. This is software, and software can be used by bad actors."
The difficulty in differentiating AI-generated voices from real ones may also be a problem for creators who can no longer distinguish between their voice and that of an AI-generated tool, said Chirag Shah, a professor at the Information School at the University of Washington.
"This is a concern for real [voice] artists, too, because now they can see themselves being easily replaced by these kinds of voice synthesizers … or text synthesizers," Shah said.
This could become an even bigger problem if the technology trains itself without the permission of voice artists and doesn't give them credit.
It's unclear if this is the case with ElevenLabs. Staniszewski said ownership of each voice the system trains on belongs to the person using the platform.
With only $2 million in funding, ElevenLabs still has a lot of growing to do. However, the best way to stand out among the fast-growing ranks of generative AI vendors is for it to continue to put its models out there, Gualtieri said.
"Creating these models, that's very academic," he said. "What's going to help them is getting these models out to the public and facing the criticisms and doing the tests."
This will help vendors of voice-generating AI technology to improve the models and eliminate some of the weaknesses that could help others misuse the technology, he added.
"They can innovate and uncover very quickly all of the ups and downs and then try to protect their reputation and protect their customers' reputation," Gualtieri said.