Follow Slashdot stories on Twitter

 



Forgot your password?
typodupeerror
×
AI

Stable Diffusion 3.0 Debuts New Architecture To Reinvent Text-To-Image Gen AI 15

An anonymous reader quotes a report from VentureBeat: Stability AI is out today with an early preview of its Stable Diffusion 3.0 next-generation flagship text-to-image generative AI model. The new Stable Diffusion 3.0 model aims to provide improved image quality and better performance in generating images from multi-subject prompts. It will also provide significantly better typography than prior Stable Diffusion models enabling more accurate and consistent spelling inside of generated images. Typography has been an area of weakness for Stable Diffusion in the past and one that rivals including DALL-E 3, Ideogram and Midjourney have also been working on with recent releases. Stability AI is building out Stable Diffusion 3.0 in multiple model sizes ranging from 800M to 8B parameters.

Stable Diffusion 3.0 isn't just a new version of a model that Stability AI has already released, it's actually based on a new architecture. "Stable Diffusion 3 is a diffusion transformer, a new type of architecture similar to the one used in the recent OpenAI Sora model," Emad Mostaque, CEO of Stability AI told VentureBeat. "It is the real successor to the original Stable Diffusion." [...] Stable Diffusion 3.0 is taking a different approach by using diffusion transformers. "Stable Diffusion did not have a transformer before," Mostaque said.

Transformers are at the foundation of much of the gen AI revolution and are widely used as the basis of text generation models. Image generation has largely been in the realm of diffusion models. The research paper that details Diffusion Transformers (DiTs), explains that it is a new architecture for diffusion models that replaces the commonly used U-Net backbone with a transformer operating on latent image patches. The DiTs approach can use compute more efficiently and can outperform other forms of diffusion image generation. The other big innovation that Stable Diffusion benefits from is flow matching. The research paper on flow matching explains that it is a new method for training Continuous Normalizing Flows (CNFs) to model complex data distributions. According to the researchers, using Conditional Flow Matching (CFM) with optimal transport paths leads to faster training, more efficient sampling, and better performance compared to diffusion paths.
This discussion has been archived. No new comments can be posted.

Stable Diffusion 3.0 Debuts New Architecture To Reinvent Text-To-Image Gen AI

Comments Filter:
  • by Rei ( 128717 ) on Friday February 23, 2024 @09:03AM (#64262288) Homepage

    ... came out RIGHT after they released Stable Cascade. I only got Stable Cascade working on my system (through a half-implemented plugin) just a couple days ago. Results comparing it to SDXL here [dbzer0.com]. This announcement sure takes the wind out of Stable Cascade's sails...

    • by Rei ( 128717 )

      As for SD3's architecture: *finally* we get Transformers integration! This should hopefully resolve issues like "A room with no elephants. Anything except for an elephant" giving you a room full of elephants. Or "A red box on top of a blue sphere" giving you boxes and/or spheres in whatever random combination of colours and orderings it wants. And should greatly increase the understanding that words aren't just patterns to play around with the same way you might play around with the shape of a tree, but

      • Only a tiny amount of public tagged content allows extraction of depth (stereoscopic and video). For something trained from tagged data cribbed from the web, it's hardly an option. They can't throw 100s of Millions at English speaking third world nations to tag stuff like OpenAI can.

        • by Rei ( 128717 )

          This is simply false. We (AUTOMATIC1111 users) commonly use depth models in our everyday workflows, which calculate depth from static images. They work great.

      • by JBMcB ( 73720 )

        As for SD3's architecture: *finally* we get Transformers integration! This should hopefully resolve issues like "A room with no elephants. Anything except for an elephant" giving you a room full of elephants.

        Isn't that what negative prompting is for? I think that's a better solution as English, in particular, can be weird to parse when dealing with negatives.

        • by Rei ( 128717 )

          That's a hack for dealing with that particular case (if it works at all). The case exposes the fundamental problem with the model's lack of understanding with the prompt. The attention mechanism is just way too simple; it's more like "just trying to make sure that all things that the user mentions in the prompt exist in the image", rather than trying to actually understand the prompt.

      • Comments like this are why I still come here. I only had 4/5ths of what I needed to know. (y)

      • by Z00L00K ( 682162 )

        All I'd like is for it to know what Babylon 5 is.

    • Excellent thread, man. Congrats.
      I do have mod points, but I believe it's better in this case to reply :)

  • by bb_matt ( 5705262 ) on Friday February 23, 2024 @09:43AM (#64262374)

    https://stability.ai/news/stab... [stability.ai]

    Seems the ability to add proper links is beyond the abilities of some...

  • oboi (Score:1, Troll)

    by CEC-P ( 10248912 )
    Hey guys...guys...I heard this one can generate pics of white people and wasn't designed by someone named Jules Walter who tweets basically KKK-level tweets except about white people. (I think that's the right person. They buried it. Lots of delusional hate-monger lefties on that one)

Truly simple systems... require infinite testing. -- Norman Augustine

Working...