New 'Stable Video Diffusion' AI Model Can Animate Any Still Image (arstechnica.com) 13
An anonymous reader quotes a report from Ars Technica: On Tuesday, Stability AI released Stable Video Diffusion, a new free AI research tool that can turn any still image into a short video -- with mixed results. It's an open-weights preview of two AI models that use a technique called image-to-video, and it can run locally on a machine with an Nvidia GPU. [...] Right now, Stable Video Diffusion consists of two models: one that can produce image-to-video synthesis at 14 frames of length (called "SVD"), and another that generates 25 frames (called "SVD-XT"). They can operate at varying speeds from 3 to 30 frames per second, and they output short (typically 2-4 second-long) MP4 video clips at 576x1024 resolution.
In our local testing, a 14-frame generation took about 30 minutes to create on an Nvidia RTX 3060 graphics card, but users can experiment with running the models much faster on the cloud through services like Hugging Face and Replicate (some of which you may need to pay for). In our experiments, the generated animation typically keeps a portion of the scene static and adds panning and zooming effects or animates smoke or fire. People depicted in photos often do not move, although we did get one Getty image of Steve Wozniak to slightly come to life.
Given these limitations, Stability emphasizes that the model is still early and is intended for research only. "While we eagerly update our models with the latest advancements and work to incorporate your feedback," the company writes on its website, "this model is not intended for real-world or commercial applications at this stage. Your insights and feedback on safety and quality are important to refining this model for its eventual release." Notably, but perhaps unsurprisingly, the Stable Video Diffusion research paper (PDF) does not reveal the source of the models' training datasets, only saying that the research team used "a large video dataset comprising roughly 600 million samples" that they curated into the Large Video Dataset (LVD), which consists of 580 million annotated video clips that span 212 years of content in duration.
In our local testing, a 14-frame generation took about 30 minutes to create on an Nvidia RTX 3060 graphics card, but users can experiment with running the models much faster on the cloud through services like Hugging Face and Replicate (some of which you may need to pay for). In our experiments, the generated animation typically keeps a portion of the scene static and adds panning and zooming effects or animates smoke or fire. People depicted in photos often do not move, although we did get one Getty image of Steve Wozniak to slightly come to life.
Given these limitations, Stability emphasizes that the model is still early and is intended for research only. "While we eagerly update our models with the latest advancements and work to incorporate your feedback," the company writes on its website, "this model is not intended for real-world or commercial applications at this stage. Your insights and feedback on safety and quality are important to refining this model for its eventual release." Notably, but perhaps unsurprisingly, the Stable Video Diffusion research paper (PDF) does not reveal the source of the models' training datasets, only saying that the research team used "a large video dataset comprising roughly 600 million samples" that they curated into the Large Video Dataset (LVD), which consists of 580 million annotated video clips that span 212 years of content in duration.
A picture of Chuck Norris set my RTX3060 on fire! (Score:5, Funny)
Only Chuck Norris decides when he wants to move.
I guess some pictures are not to be animated.
Pics or it didn't happen (Score:1)
Animated, of course.
It's a beginning (Score:3)
People these days want perfect stuff from the get-go, and if you expect plonking an image of someone in there and generating a video off it, with bells and whistles, you will be sorely disappointed.
It's a hesitant beginning. Wait 10-15 years, then we'll see.
Re: (Score:3)
It's also just bad work by the author of the article.
You don't just publish raw outputs, you at least want to run them through frame tweening first. At least use ffmpeg's motion interpolation - don't release your article with, what, 4-frame-per-second videos?
Also, SD doesn't limit you to only 14 frames. Put in more than half an hour of compute time on a 3060, come on. A rented 3060 on vast.ai costs ~$0.25/hr ATM.
And a new magazine was launched on that same day (Score:2)
And a new magazine was launched for the occasion; Hentaimes was not a declaration of civilization reaching the end times, that's for sure.
Can't wait for slashdot's famous.. (Score:2)
goatse image to be animated. I'm sure it'll become a new site favorite!
Don't... (Score:2)
Re: (Score:2)
Seems like a short step to something useful (Score:2)
If the tool converted 2D images into 3D models which could then be imported and manipulated in 3D tools then it would be quite useful.
How is this different from HeyGen ? (Score:2)
This doesn't look news worthy to me.
And, yes, I read the summary.