Catch up on stories from the past week (and beyond) at the Slashdot story archive

 



Forgot your password?
typodupeerror
×
AI Microsoft

Microsoft's New AI Can Simulate Anyone's Voice With 3 Seconds of Audio (arstechnica.com) 71

An anonymous reader quotes a report from ArsTechnica: On Thursday, Microsoft researchers announced a new text-to-speech AI model called VALL-E that can closely simulate a person's voice when given a three-second audio sample. Once it learns a specific voice, VALL-E can synthesize audio of that person saying anything -- and do it in a way that attempts to preserve the speaker's emotional tone. Its creators speculate that VALL-E could be used for high-quality text-to-speech applications, speech editing where a recording of a person could be edited and changed from a text transcript (making them say something they originally didn't), and audio content creation when combined with other generative AI models like GPT-3.

Microsoft calls VALL-E a "neural codec language model," and it builds off of a technology called EnCodec, which Meta announced in October 2022. Unlike other text-to-speech methods that typically synthesize speech by manipulating waveforms, VALL-E generates discrete audio codec codes from text and acoustic prompts. It basically analyzes how a person sounds, breaks that information into discrete components (called "tokens") thanks to EnCodec, and uses training data to match what it "knows" about how that voice would sound if it spoke other phrases outside of the three-second sample. Or, as Microsoft puts it in the VALL-E paper (PDF): "To synthesize personalized speech (e.g., zero-shot TTS), VALL-E generates the corresponding acoustic tokens conditioned on the acoustic tokens of the 3-second enrolled recording and the phoneme prompt, which constrain the speaker and content information respectively. Finally, the generated acoustic tokens are used to synthesize the final waveform with the corresponding neural codec decoder."

[...] While using VALL-E to generate those results, the researchers only fed the three-second "Speaker Prompt" sample and a text string (what they wanted the voice to say) into VALL-E. So compare the "Ground Truth" sample to the "VALL-E" sample. In some cases, the two samples are very close. Some VALL-E results seem computer-generated, but others could potentially be mistaken for a human's speech, which is the goal of the model. In addition to preserving a speaker's vocal timbre and emotional tone, VALL-E can also imitate the "acoustic environment" of the sample audio. For example, if the sample came from a telephone call, the audio output will simulate the acoustic and frequency properties of a telephone call in its synthesized output (that's a fancy way of saying it will sound like a telephone call, too). And Microsoft's samples (in the "Synthesis of Diversity" section) demonstrate that VALL-E can generate variations in voice tone by changing the random seed used in the generation process.
Microsoft has not provided VALL-E code for others to experiment with, likely to avoid fueling misinformation and deception.

In conclusion, the researchers write: "Since VALL-E could synthesize speech that maintains speaker identity, it may carry potential risks in misuse of the model, such as spoofing voice identification or impersonating a specific speaker. To mitigate such risks, it is possible to build a detection model to discriminate whether an audio clip was synthesized by VALL-E. We will also put Microsoft AI Principles into practice when further developing the models."
This discussion has been archived. No new comments can be posted.

Microsoft's New AI Can Simulate Anyone's Voice With 3 Seconds of Audio

Comments Filter:
  • by Anonymous Coward

    ...the new Bruce Willis movies where he actually talks.

    • by Z80a ( 971949 )

      In every language sounding the same

  • "Since VALL-E could synthesize speech that maintains speaker identity, it may carry potential risks in misuse of the model, such as spoofing voice identification or impersonating a specific speaker. To mitigate such risks, it is possible to build a detection model to discriminate whether an audio clip was synthesized by VALL-E. We will also put Microsoft AI Principles into practice when further developing the models."

    Selling the vaccine along with the disease? How very quaint of Greed in our modern pandemic-riddled times..

    I wonder if it will be "possible" to hire this 'detection model' for less than $10K/hour when you're falsely accused of a crime and need the fucking thing to prove your innocence...not that obvious risks aren't obvious or anything...

    • by DarkOx ( 621550 )

      I can't think of a better way to silence any independent voice and citizen journalism than to create tools to flood the world with fakes images, audio clips, and documents that as far as the average person can tell are indistinguishable for real recordings and individuals writing styles.

      Next you make sure tools to actually authenticate media as real vs generated are complex and expensive so that only a handful of big-tech and media gatekeepers can reliably do so - if they chose to do so - and when they chos

      • While I agree about the useful idiots defending big tech, I am more optimistic about flooding the world with fake everything -- simply because when everyone has the same tools, the people combined are a greater force than any organization. This is why media propaganda doesn't work today, people simply chose to hear what they prefer to hear anyway. (I used to say, people don't hate Trump because they watch CNN -- they watch CNN because they hate Trump. But Trump is now gone, and CNN is moving towards the mid

        • by DarkOx ( 621550 )

          I appreciate your optimism, its kind of refreshing. I just don't see our system working in that environment. Sure all the information sources can cancel each other out but the impact of that will be mean you can't find verifiable reliable information on anything you can't directly observe.

          Which means we really won't be able to have a national conversation about anything. Truth is maybe we should not, but then we probably also need to do a serious rethink about national vs state and local power. I'd really

    • There's a wonderful Soviet book of science fiction called "Candidate of Death" where this scenario plays out. It's set in near-future USA where justice was "objectivized" by essentially using a computer that accepts convenrted (to punch cards and tapes ofc.) evidence and renders a verdict, since it's unbribable. The private detective in question is hired by a victim of such justice, claiming he was falsely accused of murder, and indeed in the end, he finds a chain of evidence-forgers linked to the mob. It's
  • Try to emulate this [youtube.com]!
    • Brilliant!
    • by Rei ( 128717 )

      One thing that I found when messing with TortoiseTTS to mimic voices (example [youtube.com]) was that the more "conventional" the speaker sounded (native English speaker, no accents, etc), the better of a job it did, but the more of an accent or other weirdness, the more difficulty it had. For example, I tried to set up a mock phone call between Zelenskyy and Putin, and it was a total fail.

  • Cue the Disney lawyers in 3... 2... 1.

    Speaking of Disney, how long before we hear VALL-E talking like Mickey Mouse?

  • I cannot image a single legitimate use-case for this. Thank you Microsoft, for making the scammers life so much better!
    • by Bert64 ( 520050 ) <bert AT slashdot DOT firenzee DOT com> on Tuesday January 10, 2023 @09:28AM (#63195058) Homepage

      There are some...

      Dubbing movies into languages the original actors cannot speak, while retaining the voice of the original actor.
      Remaking old movies where the original actors are dead.
      Introducing flashback scenes into new movies when the original actors who played those characters look and sound significantly different now.
      Teaching people why they should be sceptical about what they hear and see in the media.

      • by HiThere ( 15173 )

        Allowing people with ALS to "speak understandably". (But it will be too expensive for that to be common. It's not just the speech synthesis, it's the customized input...which needs to keep getting altered as the disease progresses.)

      • At last, the French, Spanish and Germans can get the marvels of Mr. T in re-runs of the A-Team*.

        (I couldn't really "get with" the French dubbed A-Team - the B.A voice actor just sounded like he was putting on a "tough" voice, whereas it sounds completely natural from the T).

        * And possibly more "important" use-cases too ;-)

      • With an AI Biden, people may think he is not in a vegetative state anymore.
      • by Anonymous Coward

        You forgot number one usecase for any new technology, porn.

      • by xanthos ( 73578 )
        "Remaking old movies where the original actors are dead. Introducing flashback scenes into new movies when the original actors who played those characters look and sound significantly different now."

        Or both as in the case of the documentary/movie Roadrunner where they used AI to read letters and emails in Anthony Bourdain's voice. I can't for the life of me ever thinking he would have approved of such f*ckery.
      • It's hard for me to imagine that it could manage a different language effectively; I get how the theory would make it possible (separate voice from language), but most fluent speakers of a language sound different in each language-- at least when you are going from a Germanic language to a Romance language. Essentially the acquired language takes on the voice of the teacher in many ways... but there is more to it.

        • "Edit"
          Scratch that. I can also think of people that it takes time (when listening in the background) for me to process what language they are speaking, going between English and Spanish or even English and Thai. I guess there would be no guarantee that a simulated voice would match how they would speak a different language.

    • by NFN_NLN ( 633283 )

      > I cannot image a single legitimate use-case for this.

      Now everyone can have a custom Christopher Walken voicemail message.

  • by sTERNKERN ( 1290626 ) on Tuesday January 10, 2023 @09:29AM (#63195062)
    "Hey Grandma, it's me. I am in trouble, could You please send some money?"
    • by gtall ( 79522 )

      Bot Grandma: Howdy Son, I see you'd like some Cryptocoins. I'll have them stashed away in an account for you just as soon as you contribute to my Old Age You_Fund_Me account.

    • by Pascoea ( 968200 )
      Fix the caller ID system and most of those issues go away.
  • by RemindMeLater ( 7146661 ) on Tuesday January 10, 2023 @09:41AM (#63195100)
    Seriously... "Schwab's voice ID service allows you to access your account just by speaking one simple phrase, "At Schwab, my voice is my password."" https://www.schwab.com/voice-i... [schwab.com]
    • by coofercat ( 719737 ) on Tuesday January 10, 2023 @10:19AM (#63195236) Homepage Journal

      I do wonder about that - and more simplistically, you can setup "profiles" for Amazon Alexa so it only responds to your voice. Will these be fooled?

      Actually, I suspect they won't (yet). About 30 years ago Nuance had similar "voice print" technology, which couldn't be fooled by a voice impersonator that was good enough to fool (most) humans (at least for the sentences in the demo). Apparently there are some underlying harmonics that don't change the sound to any perceptible degree that the computer can pick up. On this occasion, the "new" voice is built from the "old" one, rather than being entirely synthesised, so those secret harmonics might be present, but perhaps not constructed correctly enough. As I say, it's only a matter of time until AI can do it "perfectly", if indeed it can't already fool existing systems.

    • by AmiMoJo ( 196126 )

      My banks tried to sign me up for this a few years ago, but I declined. My banks are still idiots though, can't use U2F and their password policy enforces weakness.

  • by altp ( 108775 ) on Tuesday January 10, 2023 @09:43AM (#63195108)

    The first use case that I thought of would be for someone like Stephen Hawkings, and other people that can no longer speak the way they used to. Imagine how great it would be to be able to use your own voice after a horrible accident or disease.

    We are either going to love and embrace all these new technologies, and learn how to mitigate the threats from them ... which is largely the same thing we've had to do with all new technologies, ever ...

    OR, we are going to be terrified of them and regulate them out of relevance.

    The difference is, we are at a point of rapid advancement in truly game-changing technologies, some of which we don't fully understand, and that's "scary".

    Personally, i think that these are exciting times and I truly hope that we embrace these technologies.

    • Right?
      There's lots of scary things to worry about, but most of that could be combatted by increasing international enforcement efforts against fraud...which we need to do anyway.
      This type of technology will ....oh, I really hate to use this word, but I will because this is one case where it actually belongs..."synergize" extremely well with other AI advances currently happening. GPT-3 is making it possible to auto-generate large amounts of text: text-to-voice tech obviously mates up with that perfectly.
      The

    • Re: Stephen Hawking

      Bad example: when offered an upgraded speech engine, he kindly declined, declaring that the one he used is how everyone knew him, that he identified with it.

      He did, however, get a couple upgrades to his text input system over the years.

  • Microsoft is investing in PR via /. lately...
  • by Miles_O'Toole ( 5152533 ) on Tuesday January 10, 2023 @09:50AM (#63195128)

    Who will be first to moan in ecstasy and beg for my favours...Marilyn Monroe? Miss Piggie?

  • No more voice passwords.
  • Link:
    - Live Text to speech
    - Chat GPT
    - Vall-E

    Hook it up to a VOIP server, and bait the Amazon Refund scam or one of the many phone scams... sample their voice, and use their own voice with Vall-E and see how long you could keep them on the phone.. :O

  • How many times did the author of this article write "WALL-E" instead of "VALL-E"?

  • The Terminator got Sarah Connor's location by copying her grandmother's voice.

    Of course this implementation doesn't have to kill you first.

    • by Pascoea ( 968200 )

      Of course this implementation doesn't have to kill you first.

      Yet. This is just a pilot program. They'll get there, we just have to give them time to work the bugs out.

  • Suppose I'm building some sort of device and I want Majel Barret's voice. If the AI can generate it, whose voice is it really? What if it's mixed with other voices?

  • The article is full of anxiety mongering and while I believe that there is significant place for caution now, 3 seconds is a hyperbolic claim.

    Simply, there's no way 3 seconds of anyone's speech is going to carry enough phonemes to 'get' that individual's particular speech patterns, particularly around the transitions between them which are a geometrically larger number.

    Is this system powerful? Yes.
    Is it impressive? Yes.
    Is it capable of learning speech patterns astonishingly quickly? Yes. (Which is where t

  • This sounds like marketing bullshit from a company that historically has not been able to find its arse with both hands and a map.

    • Odd that a company that can't find it's own arse has a market cap of $1.7 TRILLION, isn't it?
      • by nagora ( 177841 )

        Odd that a company that can't find it's own arse has a market cap of $1.7 TRILLION, isn't it?

        Yeah. They're still shit, though and this is still marketing bollocks.

  • The Acid test for this technology should be 3 seconds of Christopher Walken audio. GooOod...LoUucK!
  • Can it simulate my dad saying, "I'm proud of you, son"? I'd pay a lot of money for that!
  • by nospam007 ( 722110 ) * on Tuesday January 10, 2023 @01:00PM (#63195910)

    Now law enforcement can MAKE tapes with evidence.

  • that doesn't readily lend itself to heaps of abuse at all. I'm sure there were a lot of good reasons to develop this.
  • Why did they think this was a good idea?

Top Ten Things Overheard At The ANSI C Draft Committee Meetings: (5) All right, who's the wiseguy who stuck this trigraph stuff in here?

Working...