Microsoft's VASA-1 Can Deepfake a Person With One Photo and One Audio Track

Microsoft's VASA-1 Can Deepfake a Person With One Photo and One Audio Track (arstechnica.com) 13

Posted by msmash on Friday April 19, 2024 @03:30PM from the how-about-that dept.

Microsoft Research Asia earlier this week unveiled VASA-1, an AI model that can create a synchronized animated video of a person talking or singing from a single photo and an existing audio track. ArsTechnica: In the future, it could power virtual avatars that render locally and don't require video feeds -- or allow anyone with similar tools to take a photo of a person found online and make them appear to say whatever they want. "It paves the way for real-time engagements with lifelike avatars that emulate human conversational behaviors," reads the abstract of the accompanying research paper titled, "VASA-1: Lifelike Audio-Driven Talking Faces Generated in Real Time." It's the work of Sicheng Xu, Guojun Chen, Yu-Xiao Guo, Jiaolong Yang, Chong Li, Zhenyu Zang, Yizhong Zhang, Xin Tong, and Baining Guo.

The VASA framework (short for "Visual Affective Skills Animator") uses machine learning to analyze a static image along with a speech audio clip. It is then able to generate a realistic video with precise facial expressions, head movements, and lip-syncing to the audio. It does not clone or simulate voices (like other Microsoft research) but relies on an existing audio input that could be specially recorded or spoken for a particular purpose.

Microsoft's VASA-1 Can Deepfake a Person With One Photo and One Audio Track

This discussion has been archived. No new comments can be posted.

Load All Comments

Search 13 Comments Log In/Create an Account

Comments Filter:

Opportunities (Score:5, Insightful)

by Randseed ( 132501 ) writes: on Friday April 19, 2024 @03:37PM (#64408696)

Surely this will not be abused.

- Re: (Score:2)
  
  by omnichad ( 1198475 ) writes:
  
  Research like this needs to be published early and often. At least then some people will question what they see when it is abused.
- Re: (Score:3)
  
  by timeOday ( 582209 ) writes:
  
  This particular code might not be, since they're not releasing it due to the risk (certainty) of abuse.
  But wow, this is so much better than I've ever seen in a video game. Combined with the ability to generate faces indistinguishable from real, NPCs are about to take a massive leap forward.
VLOGGER, Google's equivalent, released last month (Score:3)

by Khopesh ( 112447 ) writes: on Friday April 19, 2024 @04:15PM (#64408822) Homepage Journal

Just over a month ago, Google announced its VLOGGER [venturebeat.com] AI that does this. (Yes, there's a demo video in there too.)
This is pretty cool stuff and I can't wait to see how it's employed as lossy video compression for video calls: why transmit what you can render closely enough? This should be super-additive to the concurrent research being performed on recovering from data loss (Opus 1.5 [slashdot.org] just got such a feature).

- Re: (Score:3)
  
  by timeOday ( 582209 ) writes:
  
  In "Infinite Jest," (1996) David Foster Wallace recounts a fictional history of video phones in which people eventually quit using them because people couldn't resist the temptation to increasingly embellish their appearance until it was all fictional and served no purpose.
Why was Microsoft working to develop this ? (Score:2)

by Big Bipper ( 1120937 ) writes:

Doesn't sound like something with a large enough legitimate market to make money off. Does Microsoft have a development lab in Wuhan too ?
- Re: (Score:2)
  
  by RitchCraft ( 6454710 ) writes:
  
  Given the author's names it would seem so.
Just in time (Score:2)

by battingly ( 5065477 ) writes:

Just in time for the presidential election. This is going to be an awful six months of being assaulted with election campaign weapons of mass destruction.
little tired of saying it (Score:2, Troll)

by Big Hairy Gorilla ( 9839972 ) writes:

The primary use case for AI is Fraud.

The purveyors aren't the slightest coy about it.
Real time video generation with your words coming out of someone else's mouth.. ie. a puppet.
"Grandma, I messed up. Now I need you to wire all your money to me in Nigeria."
There are millions of reasons why this is bad and arguably zero use cases that aren't fraud.

"We are opposed to any behavior to create misleading or harmful contents of real persons, and are interested in applying our technique for advancing forgery detect
Convincingly? Of course not. (Score:2)

by Smonster ( 2884001 ) writes:

Clickbait.
VASA? (Score:2)

by Bu11etmagnet ( 1071376 ) writes:

Let's hope it sinks like the other Vasa [wikipedia.org]

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

Microsoft's VASA-1 Can Deepfake a Person With One Photo and One Audio Track (arstechnica.com) 13

Microsoft's VASA-1 Can Deepfake a Person With One Photo and One Audio Track More Login

Microsoft's VASA-1 Can Deepfake a Person With One Photo and One Audio Track

Opportunities (Score:5, Insightful)

Re: (Score:2)

Re: (Score:3)

VLOGGER, Google's equivalent, released last month (Score:3)

Re: (Score:3)

Why was Microsoft working to develop this ? (Score:2)

Re: (Score:2)

Just in time (Score:2)

little tired of saying it (Score:2, Troll)

Convincingly? Of course not. (Score:2)

VASA? (Score:2)

Related Links Top of the: day, week, month.

Slashdot Top Deals

Slashdot