Want to read Slashdot from your mobile device? Point it at m.slashdot.org and keep reading!

 



Forgot your password?
typodupeerror
×
AI

OpenAI Reveals AI Tool To Recreate Human Voices (axios.com) 24

An anonymous reader quotes a report from Axios: OpenAI said on Friday it's allowed a small number of businesses to test a new tool that can recreate a person's voice from just a 15-second recording. The company said it is taking "a cautious and informed approach" to releasing the program, called Voice Engine, more broadly given the high risk of abuse presented by synthetic voice generators.

Based on the 15-second recording, the program can create a "emotive and realistic" natural-sounding voice that closely resembles the original speaker. This synthetic voice can then be used to read text inputs, even if the text isn't in the original speaker's native language. In one example offered by the company, an English speaker's voice was translated into Spanish, Mandarin, German, French and Japanese while preserving the speaker's native accent.

OpenAI said Voice Engine has so far been used to provide reading assistance to non-readers, translate content and to help people who are non-verbal. It said the program has already been used in its text-to-speech application and its ChatGPT Voice and Read Aloud tool.
"We hope to start a dialogue on the responsible deployment of synthetic voices, and how society can adapt to these new capabilities," the company said. "Based on these conversations and the results of these small scale tests, we will make a more informed decision about whether and how to deploy this technology at scale."
This discussion has been archived. No new comments can be posted.

OpenAI Reveals AI Tool To Recreate Human Voices

Comments Filter:
  • by systemd-anonymousd ( 6652324 ) on Friday March 29, 2024 @03:06PM (#64354230)

    11labs (paid) and RVC (free, local) already do this, and the latter without censorship

    • 11labs (paid) and RVC (free, local) already do this, and the latter without censorship

      11labs needs a minimum of 30 minutes of speech while RVC needs several minutes of a recorded speech. 15 seconds is a huge difference and means they are not depending on the voice making every type of sound/transition and then trying to recreate that.

      If they can realistically produce a similar sounding voice using data measured in seconds then it's certain that they are using a fundamentally different approach than existing systems.

      • >11labs needs a minimum of 30 minutes of speech while RVC needs several minutes of a recorded speech.

        That's wrong.

        "Short on time? No worries. Even brief audio snippets can be effective for generating a reliable voice clone." https://elevenlabs.io/voice-cl... [elevenlabs.io]

        I had good results with 11labs with less than a minute of audio.

        RVC needs a few minutes but there are forks that work with less than a minute, like Coqui, with 6 seconds: https://huggingface.co/coqui/X... [huggingface.co]

    • My favorite for cartoon character voices was 15ai (still down over a year and (and a half?) later), since you could do text to speech in the voice of many cartoon characters with different tones/inflections for different emotions. *le sad sigh*
      • 15ai was killed by the sole dev's insanity and commitment to being as closed as possible, and funneling people through Patreon. He also was basically running RVC and Tortoise, but the real quality was just finely curated datasets. It became a grift, and as of 6 months ago he was getting $500/mo. from his Patrons while releasing just teases. You can search "site:reddit.com 15ai" and learn more.

        • The "as closed as possible" bit, ngl, always bothered me (since I believe tech like this should be open, moreso knowing the foundation here is essentially built on OSS). I wonder, perhaps a stupid question, does this mean that if someone really, REALLY wanted to, they could (with relative ease) start up their own 15ai type deal, with most of the same things that made it attractive in the first place?
  • There I Ruined It has been using some AI tool to make song parodies to absolutely hilarious results. I think my latest favorite has to be the Bro Country Song. [youtube.com] I'm kind of disappointed someone hasn't made a Cybertruck commercial parody using it as the background music.

  • by MpVpRb ( 1423381 ) on Friday March 29, 2024 @03:26PM (#64354304)

    But WHY do the AI companies keep releasing stuff that allows evildoers to more easily do evil?
    We need tools for drug development, xray reading, automated software security scans, improved analysis of physics data and lots more beneficial stuff
    Meanwhile, the AI companies seem focused on making tools for scammers and worse

  • by zooblethorpe ( 686757 ) on Friday March 29, 2024 @03:45PM (#64354386)

    "... an English speaker's voice was translated into Spanish, Mandarin, German, French and Japanese while preserving the speaker's native accent."

    What?

    So the English speaker's original UK Received Pronunciation accent is preserved after translation into Japanese? How the flippety fuck does that work?

    I even went the extra step and read the linked article [axios.com] itself. Same text, for this piece, with no additional information.

    I rather suspect that they are misusing the word "accent" here.

    (Written from the perspective of a professional translator and occasional interpreter for English and Japanese.)

    • I suppose it just means they would give the english speaker a generic english accent in the generated japanese. Maybe you can even dial in how much accent your pretend-japanese-speaking-self to have.
    • Listen to the samples in the linked OpenAI blog post. It does indeed preserve the native accent (which makes it sound like an American trying to speak a foreign language).

      • Oofda. Preserving an American accent when rendering in other languages seems ... an unfortunate choice.

        I also noticed a goof in the Japanese audio, where the final clause became "yuujou no kizu o iwaimashou" — "let's celebrate the wounds / scars of friendship", instead of kizuna, "connections / bonds / ties". The text spells it out correctly, but the audio is missing that na on the end of kizuna. That's not an issue of accent, that's just a plain old vocabulary goof. Interesting.

        • I think there is some tweaking they could do one way or the other. In the French pronunciation the American accent only very rarely comes through with most Rs being rolled properly (quite pronounced even for a native speaker, I'd say). The German version has far more of the accent.

          I would imagine that the training (result) can be skewed towards the correct pronunciation if desired (or even stronger towards the original speaker's accent, although it is pretty hard to imagine many valid use cases for that).

  • Obviously, they will just be concerned that something f the crap they unleash upon the world could splatter back onto them. Apart from that, greed is good! They will obviously monetize this to the maximum degree possible, like they have done with all their products so far.

  • Now we can have more videos like this: https://m.youtube.com/watch?v=... [youtube.com]

This process can check if this value is zero, and if it is, it does something child-like. -- Forbes Burkowski, CS 454, University of Washington

Working...