Microsoft's New AI Can Simulate Anyone's Voice With 3 Seconds of Audio (arstechnica.com) 71
An anonymous reader quotes a report from ArsTechnica: On Thursday, Microsoft researchers announced a new text-to-speech AI model called VALL-E that can closely simulate a person's voice when given a three-second audio sample. Once it learns a specific voice, VALL-E can synthesize audio of that person saying anything -- and do it in a way that attempts to preserve the speaker's emotional tone. Its creators speculate that VALL-E could be used for high-quality text-to-speech applications, speech editing where a recording of a person could be edited and changed from a text transcript (making them say something they originally didn't), and audio content creation when combined with other generative AI models like GPT-3.
Microsoft calls VALL-E a "neural codec language model," and it builds off of a technology called EnCodec, which Meta announced in October 2022. Unlike other text-to-speech methods that typically synthesize speech by manipulating waveforms, VALL-E generates discrete audio codec codes from text and acoustic prompts. It basically analyzes how a person sounds, breaks that information into discrete components (called "tokens") thanks to EnCodec, and uses training data to match what it "knows" about how that voice would sound if it spoke other phrases outside of the three-second sample. Or, as Microsoft puts it in the VALL-E paper (PDF): "To synthesize personalized speech (e.g., zero-shot TTS), VALL-E generates the corresponding acoustic tokens conditioned on the acoustic tokens of the 3-second enrolled recording and the phoneme prompt, which constrain the speaker and content information respectively. Finally, the generated acoustic tokens are used to synthesize the final waveform with the corresponding neural codec decoder."
[...] While using VALL-E to generate those results, the researchers only fed the three-second "Speaker Prompt" sample and a text string (what they wanted the voice to say) into VALL-E. So compare the "Ground Truth" sample to the "VALL-E" sample. In some cases, the two samples are very close. Some VALL-E results seem computer-generated, but others could potentially be mistaken for a human's speech, which is the goal of the model. In addition to preserving a speaker's vocal timbre and emotional tone, VALL-E can also imitate the "acoustic environment" of the sample audio. For example, if the sample came from a telephone call, the audio output will simulate the acoustic and frequency properties of a telephone call in its synthesized output (that's a fancy way of saying it will sound like a telephone call, too). And Microsoft's samples (in the "Synthesis of Diversity" section) demonstrate that VALL-E can generate variations in voice tone by changing the random seed used in the generation process. Microsoft has not provided VALL-E code for others to experiment with, likely to avoid fueling misinformation and deception.
In conclusion, the researchers write: "Since VALL-E could synthesize speech that maintains speaker identity, it may carry potential risks in misuse of the model, such as spoofing voice identification or impersonating a specific speaker. To mitigate such risks, it is possible to build a detection model to discriminate whether an audio clip was synthesized by VALL-E. We will also put Microsoft AI Principles into practice when further developing the models."
Microsoft calls VALL-E a "neural codec language model," and it builds off of a technology called EnCodec, which Meta announced in October 2022. Unlike other text-to-speech methods that typically synthesize speech by manipulating waveforms, VALL-E generates discrete audio codec codes from text and acoustic prompts. It basically analyzes how a person sounds, breaks that information into discrete components (called "tokens") thanks to EnCodec, and uses training data to match what it "knows" about how that voice would sound if it spoke other phrases outside of the three-second sample. Or, as Microsoft puts it in the VALL-E paper (PDF): "To synthesize personalized speech (e.g., zero-shot TTS), VALL-E generates the corresponding acoustic tokens conditioned on the acoustic tokens of the 3-second enrolled recording and the phoneme prompt, which constrain the speaker and content information respectively. Finally, the generated acoustic tokens are used to synthesize the final waveform with the corresponding neural codec decoder."
[...] While using VALL-E to generate those results, the researchers only fed the three-second "Speaker Prompt" sample and a text string (what they wanted the voice to say) into VALL-E. So compare the "Ground Truth" sample to the "VALL-E" sample. In some cases, the two samples are very close. Some VALL-E results seem computer-generated, but others could potentially be mistaken for a human's speech, which is the goal of the model. In addition to preserving a speaker's vocal timbre and emotional tone, VALL-E can also imitate the "acoustic environment" of the sample audio. For example, if the sample came from a telephone call, the audio output will simulate the acoustic and frequency properties of a telephone call in its synthesized output (that's a fancy way of saying it will sound like a telephone call, too). And Microsoft's samples (in the "Synthesis of Diversity" section) demonstrate that VALL-E can generate variations in voice tone by changing the random seed used in the generation process. Microsoft has not provided VALL-E code for others to experiment with, likely to avoid fueling misinformation and deception.
In conclusion, the researchers write: "Since VALL-E could synthesize speech that maintains speaker identity, it may carry potential risks in misuse of the model, such as spoofing voice identification or impersonating a specific speaker. To mitigate such risks, it is possible to build a detection model to discriminate whether an audio clip was synthesized by VALL-E. We will also put Microsoft AI Principles into practice when further developing the models."
I for one welcome ... (Score:1, Funny)
...the new Bruce Willis movies where he actually talks.
Re: (Score:2)
> These are thinly veiled ads, you know...
That's okay. Or would be, if there was actually a commercial product being discussed by TFS.
My wife passed away last year. I'd be willing to pay a fee to have a good approximation of her voice in my various bits of speech-generating tech.
Sure is more appealing to me than some famous actor's garglings, I have to tell ya.
Re: (Score:2)
In every language sounding the same
Selling the vaccine and the disease. (Score:1)
"Since VALL-E could synthesize speech that maintains speaker identity, it may carry potential risks in misuse of the model, such as spoofing voice identification or impersonating a specific speaker. To mitigate such risks, it is possible to build a detection model to discriminate whether an audio clip was synthesized by VALL-E. We will also put Microsoft AI Principles into practice when further developing the models."
Selling the vaccine along with the disease? How very quaint of Greed in our modern pandemic-riddled times..
I wonder if it will be "possible" to hire this 'detection model' for less than $10K/hour when you're falsely accused of a crime and need the fucking thing to prove your innocence...not that obvious risks aren't obvious or anything...
Re: (Score:2)
I can't think of a better way to silence any independent voice and citizen journalism than to create tools to flood the world with fakes images, audio clips, and documents that as far as the average person can tell are indistinguishable for real recordings and individuals writing styles.
Next you make sure tools to actually authenticate media as real vs generated are complex and expensive so that only a handful of big-tech and media gatekeepers can reliably do so - if they chose to do so - and when they chos
Re: (Score:2)
While I agree about the useful idiots defending big tech, I am more optimistic about flooding the world with fake everything -- simply because when everyone has the same tools, the people combined are a greater force than any organization. This is why media propaganda doesn't work today, people simply chose to hear what they prefer to hear anyway. (I used to say, people don't hate Trump because they watch CNN -- they watch CNN because they hate Trump. But Trump is now gone, and CNN is moving towards the mid
Re: (Score:2)
I appreciate your optimism, its kind of refreshing. I just don't see our system working in that environment. Sure all the information sources can cancel each other out but the impact of that will be mean you can't find verifiable reliable information on anything you can't directly observe.
Which means we really won't be able to have a national conversation about anything. Truth is maybe we should not, but then we probably also need to do a serious rethink about national vs state and local power. I'd really
Re: (Score:2)
Really ?!? (Score:2)
Re: Really ?!? (Score:2)
Re: (Score:3)
One thing that I found when messing with TortoiseTTS to mimic voices (example [youtube.com]) was that the more "conventional" the speaker sounded (native English speaker, no accents, etc), the better of a job it did, but the more of an accent or other weirdness, the more difficulty it had. For example, I tried to set up a mock phone call between Zelenskyy and Putin, and it was a total fail.
Re: (Score:2)
(Alex Jones [twitter.com] was easier than Trump, as Trump talks kind of weird. )
VALL-E? (Score:2)
Cue the Disney lawyers in 3... 2... 1.
Speaking of Disney, how long before we hear VALL-E talking like Mickey Mouse?
Legitimate use cases? (Score:2, Insightful)
Re:Legitimate use cases? (Score:5, Insightful)
There are some...
Dubbing movies into languages the original actors cannot speak, while retaining the voice of the original actor.
Remaking old movies where the original actors are dead.
Introducing flashback scenes into new movies when the original actors who played those characters look and sound significantly different now.
Teaching people why they should be sceptical about what they hear and see in the media.
Re: (Score:2, Troll)
"Doesn't help when a Government is also saying Trust Us while engaging in absolute corrupt fuckery."
Ya, because government is one giant wad and all parts are coordinated with the others. Grow up, the world isn't simple black and white.
Re: (Score:1)
"Doesn't help when a Government is also saying Trust Us while engaging in absolute corrupt fuckery."
Ya, because government is one giant wad and all parts are coordinated with the others. Grow up, the world isn't simple black and white.
The Donor Class, runs the world. Grow the fuck up and realize the obvious, and put down your fucking political pom-poms.
All parts are connected in Government, tied to Greed. It's literally the fucking definition of it.
Re: (Score:2)
The more widespread and well known this kind of technology becomes, the more people will learn not to trust everything they hear.
At least it should sow doubt, and cause people to question.
Re: (Score:2)
Good luck with that. People believe everything they read and hear online because Hype and Bullshit peddlers tell them to.
I would say the opposite. People disbelieve everything that doesn't fit their preconceived conclusions, and will find any pretext, no matter how flimsy, to dismiss it. This tech will just give them one more excuse to say "I don't believe it, it's all fake".
To quote Paul Simon: "...a man hears what he wants to hear and disregards the rest."
Re: (Score:2)
People believe things they want to believe, and they always have.
They certainly don't believe everything, online or not.
Belief (Score:2)
Teaching people why they should be sceptical about what they hear and see in the media.
Good luck with that. People believe everything they read and hear online because Hype and Bullshit peddlers tell them to.
No, after several years of high-profile "leaders" spouting "fake news!" at every story they didn't like, many people have been conditioned to NOT believe what they read and hear online, unless it aligns with their predisposed beliefs.
Re: (Score:2)
Allowing people with ALS to "speak understandably". (But it will be too expensive for that to be common. It's not just the speech synthesis, it's the customized input...which needs to keep getting altered as the disease progresses.)
Re: (Score:3)
At last, the French, Spanish and Germans can get the marvels of Mr. T in re-runs of the A-Team*.
(I couldn't really "get with" the French dubbed A-Team - the B.A voice actor just sounded like he was putting on a "tough" voice, whereas it sounds completely natural from the T).
* And possibly more "important" use-cases too ;-)
Re: Legitimate use cases? (Score:1)
Re: (Score:2)
Re: (Score:1)
You forgot number one usecase for any new technology, porn.
Re: (Score:2)
Re: (Score:2)
Or both as in the case of the documentary/movie Roadrunner where they used AI to read letters and emails in Anthony Bourdain's voice. I can't for the life of me ever thinking he would have approved of such f*ckery.
Re: (Score:2)
It's hard for me to imagine that it could manage a different language effectively; I get how the theory would make it possible (separate voice from language), but most fluent speakers of a language sound different in each language-- at least when you are going from a Germanic language to a Romance language. Essentially the acquired language takes on the voice of the teacher in many ways... but there is more to it.
Re: (Score:2)
"Edit"
Scratch that. I can also think of people that it takes time (when listening in the background) for me to process what language they are speaking, going between English and Spanish or even English and Thai. I guess there would be no guarantee that a simulated voice would match how they would speak a different language.
Re: (Score:2)
> I cannot image a single legitimate use-case for this.
Now everyone can have a custom Christopher Walken voicemail message.
Phishing will be so much easier (Score:4, Insightful)
Re: (Score:3)
Bot Grandma: Howdy Son, I see you'd like some Cryptocoins. I'll have them stashed away in an account for you just as soon as you contribute to my Old Age You_Fund_Me account.
Re: (Score:2)
And banks want voice as auth (Score:5, Interesting)
Re:And banks want voice as auth (Score:4, Interesting)
I do wonder about that - and more simplistically, you can setup "profiles" for Amazon Alexa so it only responds to your voice. Will these be fooled?
Actually, I suspect they won't (yet). About 30 years ago Nuance had similar "voice print" technology, which couldn't be fooled by a voice impersonator that was good enough to fool (most) humans (at least for the sentences in the demo). Apparently there are some underlying harmonics that don't change the sound to any perceptible degree that the computer can pick up. On this occasion, the "new" voice is built from the "old" one, rather than being entirely synthesised, so those secret harmonics might be present, but perhaps not constructed correctly enough. As I say, it's only a matter of time until AI can do it "perfectly", if indeed it can't already fool existing systems.
Re: (Score:2)
My banks tried to sign me up for this a few years ago, but I declined. My banks are still idiots though, can't use U2F and their password policy enforces weakness.
We are at a cross roads ... (Score:3, Insightful)
The first use case that I thought of would be for someone like Stephen Hawkings, and other people that can no longer speak the way they used to. Imagine how great it would be to be able to use your own voice after a horrible accident or disease.
We are either going to love and embrace all these new technologies, and learn how to mitigate the threats from them ... which is largely the same thing we've had to do with all new technologies, ever ...
OR, we are going to be terrified of them and regulate them out of relevance.
The difference is, we are at a point of rapid advancement in truly game-changing technologies, some of which we don't fully understand, and that's "scary".
Personally, i think that these are exciting times and I truly hope that we embrace these technologies.
Re: (Score:2)
Right? ....oh, I really hate to use this word, but I will because this is one case where it actually belongs..."synergize" extremely well with other AI advances currently happening. GPT-3 is making it possible to auto-generate large amounts of text: text-to-voice tech obviously mates up with that perfectly.
There's lots of scary things to worry about, but most of that could be combatted by increasing international enforcement efforts against fraud...which we need to do anyway.
This type of technology will
The
Re: We are at a cross roads ... (Score:2)
Re: Stephen Hawking
Bad example: when offered an upgraded speech engine, he kindly declined, declaring that the one he used is how everyone knew him, that he identified with it.
He did, however, get a couple upgrades to his text input system over the years.
Another slashvertisement (Score:2)
This is going to be fun! (Score:5, Funny)
Who will be first to moan in ecstasy and beg for my favours...Marilyn Monroe? Miss Piggie?
Re: (Score:2)
Banks with voice passwords (Score:2)
bell canada may still have voice passwords (Score:2)
bell canada may still have voice passwords
Programming challenge (Score:1)
Link:
- Live Text to speech
- Chat GPT
- Vall-E
Hook it up to a VOIP server, and bait the Amazon Refund scam or one of the many phone scams... sample their voice, and use their own voice with Vall-E and see how long you could keep them on the phone.. :O
Re: (Score:1)
Ugh.. speech to text first.. feed it into chat_gpt output result on Vall-E
Seriously (Score:2)
How many times did the author of this article write "WALL-E" instead of "VALL-E"?
Re: (Score:2)
Plagiarism from the first Terminator movie (Score:2)
The Terminator got Sarah Connor's location by copying her grandmother's voice.
Of course this implementation doesn't have to kill you first.
Re: (Score:2)
Of course this implementation doesn't have to kill you first.
Yet. This is just a pilot program. They'll get there, we just have to give them time to work the bugs out.
What are the legal ramifications? (Score:2)
Suppose I'm building some sort of device and I want Majel Barret's voice. If the AI can generate it, whose voice is it really? What if it's mixed with other voices?
maybe it can, but not well (Score:2)
The article is full of anxiety mongering and while I believe that there is significant place for caution now, 3 seconds is a hyperbolic claim.
Simply, there's no way 3 seconds of anyone's speech is going to carry enough phonemes to 'get' that individual's particular speech patterns, particularly around the transitions between them which are a geometrically larger number.
Is this system powerful? Yes.
Is it impressive? Yes.
Is it capable of learning speech patterns astonishingly quickly? Yes. (Which is where t
Bet it can't (Score:2)
This sounds like marketing bullshit from a company that historically has not been able to find its arse with both hands and a map.
Re: (Score:2)
Re: (Score:2)
Odd that a company that can't find it's own arse has a market cap of $1.7 TRILLION, isn't it?
Yeah. They're still shit, though and this is still marketing bollocks.
Acid Test (Score:2)
Re: (Score:2)
Great! (Score:2)
Nice! (Score:3)
Now law enforcement can MAKE tapes with evidence.
This is such a valuable technology (Score:2)
why? (Score:2)