• TheTechOasis
  • Posts
  • Stable Diffusion 3 & The Sinister Future of AI-Based Ads

Stable Diffusion 3 & The Sinister Future of AI-Based Ads

In partnership with

šŸ TheTechOasis šŸ

Breaking down the most advanced AI systems in the world to prepare you for the future.

5-minute weekly reads.

TLDR:

  • AI Research of the Week: Understanding the magic behind Stable Diffusion 3

  • Leaders: Video, The Next AI Interface

šŸ¤© AI Research of the week šŸ¤©

This week Stability AI announced Stable Diffusion 3 (SD3), the next evolution of the most famous open-source model for image generation.

It displays amazing results in fidelity and resolution, making it both visually and quantitatively speaking, the best text-to-image (T2I) model in the industry today.

Also, it showcases the most impressive AI-generated image I have ever seen, where the line-rendering of a zebra is accurately represented:

More importantly, it finally solves one of the hardest problems in AI images, text-image alignment, by introducing several innovative design approaches (some probably inspired by Sora from OpenAI) that will most probably set the standard for whatā€™s to come.

Mastering Flows

If we think about a generative model, be that ChatGPT or SD3, there is a common definition; they all learn the underlying distribution of the training data to sample from it.

But what does that mean?

Itā€™s all learning patterns

If we think about ChatGPT, when we say that it learns the training data distribution, we are indirectly saying that the model learns the patterns and structures of the training text.

In laymanā€™s terms, it learns how words follow each other. T2I models like SD3 are no different in that sense, but learning the distribution of the training images.

Simply put, they learn things like ā€œif I see a group of pixels depicting a dog face, the rest of the body must be nearā€.

However, hereā€™s the caveat, T2I are mainly flow diffusion models, meaning they learn to perform a transformation between the original distribution and the target one (the patterns and structures of the training data).

Again, what does that mean?

Flow Generative Models

Flow models, unlike those like ChatGPT that sample new words to follow a sequence from the data distribution, learn a ā€˜mappingā€™ that will take a noisy, random ā€˜canvasā€™ and turn it into an image that is indeed similar to other images it has seen during training and conditioned on the userā€™s request.

That way, it can draw things like ā€œan animal playing basketballā€:

Source: OpenAI

But why do we need this flow process, and instead sample from the learned distribution of images like we do with ChatGPT for text?

Well, you want your model to have some sort of randomness, or stochasticity, so that the newly generated samples are new images, not mere copies of ones that it saw during training.

This allows the model to generate new images from one single prompt like ā€œa baboon close-upā€.

Source: OpenAI

In SD3ā€™s case, this flow is a diffusion, where the model learns to predict the noise in an image and take it out.

However, here SD3 innovates quite a bit by applying Rectified flows, where this flow between points in different distributions is done in a straight line.

As the distance traveled between distributions is shorter, the sampling velocity is much higher.

In other words, images are generated much faster, which is a great improvement considering how slow models like Dall-E or MidJourney are.

I wonā€™t get into too technical details as they would take the entire newsletter, but flow models are trained by parameterizing the vector field between distributions to define the probability path that flows any point in the noisy distribution into the target distribution.

In laymanā€™s terms, think about flows as neural networks that learn to take points from one distribution (a highly-noised image) to a target distribution (an arrangement of pixels that depict reasonable scenes), and apply this flow in real time to generate new images.

For more technical depth, I suggest you taking a look at the following video, explained by Yaron Lipman from Meta.

But again, the same principle to ChatGPT applies here, generative models as AI systems that learn to generate data that is similar to what they have seen during training by learning its distribution, aka its patterns and structures.

Moving on, SD3 also embraces the concept of diffusion transformers, just like Sora did, which confirms them as the new standard.

DiTs and Weight specialization

The main backbone of the new SD3 architecture is a diffusion transformer.

Attention is all you need

That is, a design that combines diffusion, the concept we just covered, with Transformers, the architecture behind models like ChatGPT or Gemini.

The reason for this is that the main component of Transformers, the attention mechanism, works as wonderfully with images as it does with text.

In both cases, itā€™s a mechanism that allows different parts of the input data to share information so that the model builds an understanding of what itā€™s seeing.

Still, SD3 added a twist.

Weights are not meant to be shared

When working with a model that handles both text and image (you send the request of what you want with text, but the generated output is an image) the parameters of the model need to handle very complex and dissimilar patterns.

In other words, text and images are wildly different data structures.

Thus, just like the brain has different areas dedicated to text and images, here the researchers defined specific weights for each.

Naturally, you eventually cross over both sets of weights during attention to guarantee that text and image weights share information, just like the different areas in your brain communicate too.

This gives spectacular images such as this one (notice how the marmalade doesnā€™t blend with the water, which might imply that the model has some world knowledge).

Open source is alive and well

When compared to the current state-of-the-art in the increasingly popular GenEval benchmark that measures color, position, and other objective KPIs, SD3 outcompetes all of them, and the same happens in human evaluations (which are far more subjective).

Whatā€™s more, with a sufficiently powerful computer desktop you will soon be able to run SD3, both the 2B and 8B models, for free.

You will need at least 8GB of RAM (preferably 16) for 2B model, 24GB for the larger one.

While open source may be lagging in the text field, itā€™s up there when it comes to image generation.

And we are here for it.

šŸ‘¾ Best news of the week šŸ‘¾

āš”ļø Inflection releases Inflection-2.5 and matches GPT-4

šŸ’Ž This Weekā€™s Sponsor šŸ’Ž

I am always saying how convinced I am that 2024 is the year that AI goes beyond the hype to deliver useful products.

Thus, when you see a GenAI product that has more than 800.000 users and a 4.8/5 score from more than 12k reviews in Chrome alone, you know this thing is for real.

Simply put, MaxAI is a Chrome extension that allows you to chat with your browser tabs by running the best available models.

Among many features, you can:

  • Use it to write and summarize your Gmail,

  • any website,

  • send it your PDFs,

  • and even chat with YouTube videos.

And best of all, you can install it for free and they are currently offering heavy discounts of up to 40% during the next two days.

MaxAI.me - Outsmart Most People with 1-Click AI

MaxAI.me best AI features:

  • Chat with GPT-4, Claude 3, Gemini 1.5.

  • Perfect your writing anywhere.

  • Save 90% of your reading & watching time with AI summary.

  • Reply 10x faster on email & social media.

šŸ„‡ Leaders šŸ„‡

LLMs Will Soon be the Past, Video is the Future

For almost 5 years, language models have been at the cusp of the AI industry. However, itā€™s becoming clear we are on the verge of an unavoidable transition.

The unamusing results offered by Claude 3, a model that thinly beats GPT-4 (March 2023 version, we canā€™t even confirm itā€™s better than GPT-4 Turbo) signals that we might be hitting a wall.

Despite this, the worldā€™s attention seems to be concentrated on LLMs.

However, to stay ahead of the curve, one must watch what researchers are looking into as the next big thing.

And that is video.

In other words, instead of Large Language Models, or LLMs, we will soon have Large Video Models, or LVMs.

In fact, many of the recent breakthroughs in the field we will analyze today signal we are reaching a tipping point that will allow the same catalysts that brought us to where we stand today to be applied to video, taking us to the next ā€˜ChatGPT momentā€™.

However, as we will see today LVMs admit no comparison with LLMs, as they could elevate AI with new superpowers like physics modeling, improved world understanding, and a joint sensorial experience that will transform entire industries like robotics and take humanity to the next great milestone, embodied intelligence.

Subscribe to Leaders to read the rest.

Become a paying subscriber of Leaders to get access to this post and other subscriber-only content.

Already a paying subscriber? Sign In

A subscription gets you:
High-signal deep-dives into the most advanced AI in the world in a easy-to-understand language
Additional insights to other cutting-edge research you should be paying attention to
Curiosity-inducing facts and reflections to make you the most interesting person in the room