OpenAI Reveals AI Tool To Recreate Human Voices (axios.com) 24
An anonymous reader quotes a report from Axios: OpenAI said on Friday it's allowed a small number of businesses to test a new tool that can recreate a person's voice from just a 15-second recording. The company said it is taking "a cautious and informed approach" to releasing the program, called Voice Engine, more broadly given the high risk of abuse presented by synthetic voice generators.
Based on the 15-second recording, the program can create a "emotive and realistic" natural-sounding voice that closely resembles the original speaker. This synthetic voice can then be used to read text inputs, even if the text isn't in the original speaker's native language. In one example offered by the company, an English speaker's voice was translated into Spanish, Mandarin, German, French and Japanese while preserving the speaker's native accent.
OpenAI said Voice Engine has so far been used to provide reading assistance to non-readers, translate content and to help people who are non-verbal. It said the program has already been used in its text-to-speech application and its ChatGPT Voice and Read Aloud tool. "We hope to start a dialogue on the responsible deployment of synthetic voices, and how society can adapt to these new capabilities," the company said. "Based on these conversations and the results of these small scale tests, we will make a more informed decision about whether and how to deploy this technology at scale."
Based on the 15-second recording, the program can create a "emotive and realistic" natural-sounding voice that closely resembles the original speaker. This synthetic voice can then be used to read text inputs, even if the text isn't in the original speaker's native language. In one example offered by the company, an English speaker's voice was translated into Spanish, Mandarin, German, French and Japanese while preserving the speaker's native accent.
OpenAI said Voice Engine has so far been used to provide reading assistance to non-readers, translate content and to help people who are non-verbal. It said the program has already been used in its text-to-speech application and its ChatGPT Voice and Read Aloud tool. "We hope to start a dialogue on the responsible deployment of synthetic voices, and how society can adapt to these new capabilities," the company said. "Based on these conversations and the results of these small scale tests, we will make a more informed decision about whether and how to deploy this technology at scale."
Late to the party (Score:4, Insightful)
11labs (paid) and RVC (free, local) already do this, and the latter without censorship
... and down to clown. (Score:3)
11labs (paid) and RVC (free, local) already do this, and the latter without censorship
11labs needs a minimum of 30 minutes of speech while RVC needs several minutes of a recorded speech. 15 seconds is a huge difference and means they are not depending on the voice making every type of sound/transition and then trying to recreate that.
If they can realistically produce a similar sounding voice using data measured in seconds then it's certain that they are using a fundamentally different approach than existing systems.
Re: (Score:2)
>11labs needs a minimum of 30 minutes of speech while RVC needs several minutes of a recorded speech.
That's wrong.
"Short on time? No worries. Even brief audio snippets can be effective for generating a reliable voice clone." https://elevenlabs.io/voice-cl... [elevenlabs.io]
I had good results with 11labs with less than a minute of audio.
RVC needs a few minutes but there are forks that work with less than a minute, like Coqui, with 6 seconds: https://huggingface.co/coqui/X... [huggingface.co]
Re: (Score:2)
Re: (Score:2)
15ai was killed by the sole dev's insanity and commitment to being as closed as possible, and funneling people through Patreon. He also was basically running RVC and Tortoise, but the real quality was just finely curated datasets. It became a grift, and as of 6 months ago he was getting $500/mo. from his Patrons while releasing just teases. You can search "site:reddit.com 15ai" and learn more.
Re: (Score:2)
Re: (Score:2)
Certainly they could, and people have done that
Re: (Score:2)
YouTubers already using something similar (Score:2)
There I Ruined It has been using some AI tool to make song parodies to absolutely hilarious results. I think my latest favorite has to be the Bro Country Song. [youtube.com] I'm kind of disappointed someone hasn't made a Cybertruck commercial parody using it as the background music.
In general, I'm a fan of AI (Score:4, Insightful)
But WHY do the AI companies keep releasing stuff that allows evildoers to more easily do evil?
We need tools for drug development, xray reading, automated software security scans, improved analysis of physics data and lots more beneficial stuff
Meanwhile, the AI companies seem focused on making tools for scammers and worse
Re: (Score:2)
In other news, knife companies keep making knives (Score:3)
Re: (Score:2, Informative)
You are under a misapprehension here: You seem to think AI companies are run by good people who would have honor, integrity and a desire to positively contribute to the human endeavor. That is not the case.
That makes no sense at all. (Score:2)
"Native accent"?? (Score:3)
What?
So the English speaker's original UK Received Pronunciation accent is preserved after translation into Japanese? How the flippety fuck does that work?
I even went the extra step and read the linked article [axios.com] itself. Same text, for this piece, with no additional information.
I rather suspect that they are misusing the word "accent" here.
(Written from the perspective of a professional translator and occasional interpreter for English and Japanese.)
Re: (Score:2)
Re: (Score:2)
Listen to the samples in the linked OpenAI blog post. It does indeed preserve the native accent (which makes it sound like an American trying to speak a foreign language).
Re: (Score:2)
Oofda. Preserving an American accent when rendering in other languages seems ... an unfortunate choice.
I also noticed a goof in the Japanese audio, where the final clause became "yuujou no kizu o iwaimashou" — "let's celebrate the wounds / scars of friendship", instead of kizuna, "connections / bonds / ties". The text spells it out correctly, but the audio is missing that na on the end of kizuna. That's not an issue of accent, that's just a plain old vocabulary goof. Interesting.
Re: (Score:2)
I think there is some tweaking they could do one way or the other. In the French pronunciation the American accent only very rarely comes through with most Rs being rolled properly (quite pronounced even for a native speaker, I'd say). The German version has far more of the accent.
I would imagine that the training (result) can be skewed towards the correct pronunciation if desired (or even stronger towards the original speaker's accent, although it is pretty hard to imagine many valid use cases for that).
"Caution"? Nice joke! We all laughed... (Score:2)
Obviously, they will just be concerned that something f the crap they unleash upon the world could splatter back onto them. Apart from that, greed is good! They will obviously monetize this to the maximum degree possible, like they have done with all their products so far.
Sweet! (Score:2)