
Stable Diffusion 3.0 Debuts New Architecture To Reinvent Text-To-Image Gen AI 15
An anonymous reader quotes a report from VentureBeat: Stability AI is out today with an early preview of its Stable Diffusion 3.0 next-generation flagship text-to-image generative AI model. The new Stable Diffusion 3.0 model aims to provide improved image quality and better performance in generating images from multi-subject prompts. It will also provide significantly better typography than prior Stable Diffusion models enabling more accurate and consistent spelling inside of generated images. Typography has been an area of weakness for Stable Diffusion in the past and one that rivals including DALL-E 3, Ideogram and Midjourney have also been working on with recent releases. Stability AI is building out Stable Diffusion 3.0 in multiple model sizes ranging from 800M to 8B parameters.
Stable Diffusion 3.0 isn't just a new version of a model that Stability AI has already released, it's actually based on a new architecture. "Stable Diffusion 3 is a diffusion transformer, a new type of architecture similar to the one used in the recent OpenAI Sora model," Emad Mostaque, CEO of Stability AI told VentureBeat. "It is the real successor to the original Stable Diffusion." [...] Stable Diffusion 3.0 is taking a different approach by using diffusion transformers. "Stable Diffusion did not have a transformer before," Mostaque said.
Transformers are at the foundation of much of the gen AI revolution and are widely used as the basis of text generation models. Image generation has largely been in the realm of diffusion models. The research paper that details Diffusion Transformers (DiTs), explains that it is a new architecture for diffusion models that replaces the commonly used U-Net backbone with a transformer operating on latent image patches. The DiTs approach can use compute more efficiently and can outperform other forms of diffusion image generation. The other big innovation that Stable Diffusion benefits from is flow matching. The research paper on flow matching explains that it is a new method for training Continuous Normalizing Flows (CNFs) to model complex data distributions. According to the researchers, using Conditional Flow Matching (CFM) with optimal transport paths leads to faster training, more efficient sampling, and better performance compared to diffusion paths.
Stable Diffusion 3.0 isn't just a new version of a model that Stability AI has already released, it's actually based on a new architecture. "Stable Diffusion 3 is a diffusion transformer, a new type of architecture similar to the one used in the recent OpenAI Sora model," Emad Mostaque, CEO of Stability AI told VentureBeat. "It is the real successor to the original Stable Diffusion." [...] Stable Diffusion 3.0 is taking a different approach by using diffusion transformers. "Stable Diffusion did not have a transformer before," Mostaque said.
Transformers are at the foundation of much of the gen AI revolution and are widely used as the basis of text generation models. Image generation has largely been in the realm of diffusion models. The research paper that details Diffusion Transformers (DiTs), explains that it is a new architecture for diffusion models that replaces the commonly used U-Net backbone with a transformer operating on latent image patches. The DiTs approach can use compute more efficiently and can outperform other forms of diffusion image generation. The other big innovation that Stable Diffusion benefits from is flow matching. The research paper on flow matching explains that it is a new method for training Continuous Normalizing Flows (CNFs) to model complex data distributions. According to the researchers, using Conditional Flow Matching (CFM) with optimal transport paths leads to faster training, more efficient sampling, and better performance compared to diffusion paths.
It's funny because this... (Score:3)
... came out RIGHT after they released Stable Cascade. I only got Stable Cascade working on my system (through a half-implemented plugin) just a couple days ago. Results comparing it to SDXL here [dbzer0.com]. This announcement sure takes the wind out of Stable Cascade's sails...
Re: (Score:3)
As for SD3's architecture: *finally* we get Transformers integration! This should hopefully resolve issues like "A room with no elephants. Anything except for an elephant" giving you a room full of elephants. Or "A red box on top of a blue sphere" giving you boxes and/or spheres in whatever random combination of colours and orderings it wants. And should greatly increase the understanding that words aren't just patterns to play around with the same way you might play around with the shape of a tree, but
Re: (Score:2)
Only a tiny amount of public tagged content allows extraction of depth (stereoscopic and video). For something trained from tagged data cribbed from the web, it's hardly an option. They can't throw 100s of Millions at English speaking third world nations to tag stuff like OpenAI can.
Re: (Score:3)
This is simply false. We (AUTOMATIC1111 users) commonly use depth models in our everyday workflows, which calculate depth from static images. They work great.
Re: (Score:3)
As for SD3's architecture: *finally* we get Transformers integration! This should hopefully resolve issues like "A room with no elephants. Anything except for an elephant" giving you a room full of elephants.
Isn't that what negative prompting is for? I think that's a better solution as English, in particular, can be weird to parse when dealing with negatives.
Re: (Score:1)
That's a hack for dealing with that particular case (if it works at all). The case exposes the fundamental problem with the model's lack of understanding with the prompt. The attention mechanism is just way too simple; it's more like "just trying to make sure that all things that the user mentions in the prompt exist in the image", rather than trying to actually understand the prompt.
Re: (Score:2)
Comments like this are why I still come here. I only had 4/5ths of what I needed to know. (y)
Re: (Score:2)
All I'd like is for it to know what Babylon 5 is.
Re: (Score:2)
It's not like Transformers is new, or hard to implement. SD 1,5 was released in the summer of 2022. It took 1 2/3rds years after their first release to add in Transformers. Since then we've gone through 2,0, 2,1, XL, and Cascade, none of which have included Transformers.
I'm not mad at them or anything. I'm just saying that it's about time it happened, and it's good to see it finally arrive. Other models like DALL-E have already done this.
Re: (Score:2)
Excellent thread, man. Congrats. :)
I do have mod points, but I believe it's better in this case to reply
Proper link to source (Score:3)
https://stability.ai/news/stab... [stability.ai]
Seems the ability to add proper links is beyond the abilities of some...
oboi (Score:1, Troll)