TheTechOasis
Posts
Demystifying OpenAI's STUNNING New Model Sora & Meta's World Model V-JEPA

Demystifying OpenAI's STUNNING New Model Sora & Meta's World Model V-JEPA

Ignacio de Gregorio Noblejas
February 18, 2024

Sponsored by

🏝 TheTechOasis 🏝

Breaking down the most advanced AI systems in the world to prepare you for the future.

5-minute weekly reads.

TLDR:

AI Research of the Week: Sora, OpenAI’s Stunning New Video Model
Leaders: V-JEPA and World Models

🤩 AI Research of the week 🤩

The feeling that things have changed. That’s what almost everyone who has seen Sora felt.

Last week, we talked about Google’s Lumiere model and how it was better than anything we had seen. Well, a week later, OpenAI’s Sora blows Lumiere completely out of the water.

No words can describe it, so here’s an example:

Source: OpenAI

Thus, today, we are unpacking all that is known about this life-changing model, and what it means for the future of many industries.

Best of both worlds

Sora is OpenAI’s first text-to-video, image-to-video, video-to-video model, a world simulator that generates the most impressive videos we have ever seen, literally miles ahead of anything else.

Similarly to ChatGPT, it has seen Internet-scale data, but of videos and their descriptions, to learn to create up to 1-minute-long high-resolution videos.

However, the similarities with ChatGPT don’t end there.

Just like LLMs treat text as tokens (groups of 3/4 letters on average), Sora treats frame patches as tokens too, calling them “visual spacetime patches”.

But what does that even mean?

In a nutshell, Sora treats videos by decomposing them into two elements:

Frames. Just like any other video model, the easiest way to build videos is to consider them as concatenations of frames (static images), as Google’s Lumiere does.
Visual patches. Each frame is then decomposed into groups of pixels. However, these patches are 3D, as they capture more than one frame so that we can account for time too.

Then, depending on the size ratio you want the output to be, those patches are assembled into that shape to generate the video.

But how does it actually generate videos?

As we have described many times, diffusion is the standard approach to synthesizing images and videos.

During training, noise is progressively added into an image, and then the model has to predict how much noise was added, take it out, and reconstruct the original image based on the given text condition that describes the original image, like “a portrait of a relaxed cat”.

That way, during inference, the model is given a noise canvas and it’s capable of creating new images based on the given text description.

Think of diffusion as a similar exercise to what an artist like Michelangelo or Bernini would do with a marble block. They took out the “excess” marble to uncover the statue.

So, what’s the end-to-end process then?

Sora’s generation process

During training, the model is given two things, a video, and a text description of that video.

As they required Internet-scale data, they trained a separate video captioning model that took millions of videos and described them through text, instead of doing it manually.

Then, the process follows four steps: compression, patching, diffusion, and decoding. Let’s unpack each one.

Source: OpenAI

First, the video is turned into a compressed, latent representation, using a visual encoder.

But what is a representation?

Also known as an ‘embedding’, it’s the representation of an input (text, image, video…) in the form of a vector of numbers. This vector is dense, meaning that it captures the semantics of the underlying concept.

The key is that vectors representing semantically-similar concepts will be similar too. This allows machines to understand relatedness by calculating the distance between these vectors.

That way, the closer two representations are, like the vectors of “cat” and “dog”, the more similar the concepts are to the model. This similarity principle underpins not only Sora, but almost any frontier model today, including ChatGPT or Gemini.

However, this transformation has other implications too:

Efficiency. Sora works videos in compressed form, making the process much cheaper
Effectiveness. Working in representation space helps the models pay attention to what matters, instead of scrutinizing every pixel of the image.

However, it must be noted that this is a spacetime compression.

But what does that mean?

Well, for starters, as Sora is a transformer, we know that it processes videos using the attention mechanism.

In very simple terms, attention is the process of making tokens talk and updating the meaning of each token based on its explicit meaning and its surroundings.

Sounds too abstract, but it’s quite intuitive if we think about how ChatGPT works.

If we think about the sentences “The river bank” and “I’m going to the bank to take out some money” the word “bank” has different meanings depending on its surrounding context.

So what ChatGPT does to figure out what bank means in each case is performing the attention mechanism so that each word pays attention to the rest.

That way, it figures out the meaning of “bank” not only by its definition, but also in the context of its surroundings.

In Sora’s case, the attention mechanism is performed at the frame level, making visual patches talk with each other, and also at the sequence level, making frames ‘talk’ with other frames.

The former will indicate what is going on in that particular frame, and the latter will give Sora intuition of what is going on across multiple frames.

Put simply, one tells you there’s a dog in one image, and the other tells you what the dog is doing in the video.

Next, this latent representation is decomposed into the visual spacetime patches we talked earlier about, creating the cube-like shapes we see in the image above, to allow Sora to generate videos in any size, duration, and resolution required.

As we just explained, these visual patches are cube-shaped because they carry information across space (frame-level) and time (across frames).

It is important to note that this token-based mechanism is a key difference between Sora and Lumiere.

Although Lumiere is a diffusion transformer too, the model generated each image in the video all at once. Here, Sora builds every image by concatenating patches.

It’s unclear whether generating images by patches is superior to Google’s Lumiere way of generating the entire image at once.

In other words, we can’t categorically attribute Sora’s amazing performance to this patching technique. In fact, the key differentiator here seems to be the quality and size of the data used. Shutterstock deal with OpenAI back in July makes much more sense now.

We can make the case, though, that it makes Sora much more flexible in terms of size and duration of the generated videos.

Furthermore, with the patches placed at the desired arrangement, diffusion is applied to the noisy patches to create the representations of the new video.

Finally, a decoder takes these ‘clean’ new patches and decodes them back into pixel space, giving you the videos.

Denoising process. Source: OpenAI

But that’s not all.

Sora can also be conditioned on images and video, meaning that you can use it to animate images, perform video editing, interpolate videos, and many more.

Combining videos to create new ones. Source: OpenAI

But how does Sora allow for multiple conditioning?

They don’t elaborate on this, but this isn’t something new. A popular method is classifier-free diffusion guidance. Here, the model is trained to generate outputs both conditionally and unconditionally.

In other words, they sometimes tell it what to generate using a text description or an image, others they simply make it generate “something” with no condition. This way, the model learns to flexibly adapt to every situation, generalizing into several conditional generation processes.

What’s more, this gives you control by tuning a parameter, allowing you to control how much the output must pay attention to one condition or the other.

For instance, in the image below of Michelangelo’s David, depending on the value of each condition (the original image, or the text description “Turn it into a cyborg”, the more weight the text description is given, the less it looks like the original image and more like a cyborg, and vice versa.

The implications? Oh my.

Ten months ago, the state of the art of AI video generation looked like this. Now, we have AI-generated videos that require extensive investigation to uncover whether they are real or not.

Cinema, designers, artists, production companies, no one is safe now. Yet, there’s still much room for improvement.

Seeing some of the generated videos, you quickly realize that while Sora has learned a very good representation of the physics of the world, it still doesn’t understand it, generating really weird, implausible situations at times.

For that, today’s Leaders segment below tackles what Meta feels is the key to actually turning Sora and other models into world models, aka the key to AGI.

But more on that below…

🫡 This Newsletter is Brought to You by AE Studio 🫡

Seeing the speed at which AI develops, dealing with it is becoming unmanageable. And don’t get me started with actually building AI products.

In that case, just let experts do it for you starting with an 80% discount by AE Studio, a company trusted by the likes of Salesforce, Berkshire Hathaway, or Walmart, among others.

Hire a world class AI team for 80% less

Trusted by leading startups and Fortune 500 companies.

Building an AI product is hard. Engineers who understand AI are expensive and hard to find. And there's no way of telling who's legit and who's not.

That's why companies around the world trust AE Studio. We help you craft and implement the optimal AI solution for your business with our team of world class AI experts from Harvard, Princeton, and Stanford.

Our development, design, and data science teams work closely with founders and executives to create custom software and AI solutions that get the job done for a fraction of the cost.

P.S. Wondering how OpenAI DevDay impacts your business? Let’s talk!

👾 Best news of the week 👾

🥰 Apple’s new image-editing model

🧐 Google’s amazing Gemini 1.5, best LLM ever?

🤩 Meta automates unit testing with new model

🎥 Generative AI in a nutshell, by Henrik Kniberg

🥇 Leaders 🥇

V-JEPA, our first true world model?

In what might be considered the greatest week in AI since the launch of GPT-4 almost a year ago, a huge model has gone over the heads of many due to the likes of OpenAI’s Sora model and Google’s new Gemini 1.5 model.

But Meta’s new model, V-JEPA, might be the key to solving the issues that Sora is showing today.

Neural Networks, Fascinating Yet Limited

It doesn’t take much effort to understand how little Sora understands the world.

A man running on a treadmill the opposite way, a glass that breaks unexpectedly, humans walking in weird ways, or even puppies that go through each other.

These are just a handful of weird videos Sora has given us.

Don’t get me wrong, Sora is incredible at generating high-resolution videos, but also incredibly dumb.

And even OpenAI acknowledges it.

The key to why this happens becomes apparent when you understand how these models learn.

During training, Sora has seen a huge amount of videos with a detailed text description alongside them.

Then, by learning to reconstruct the original videos considering both the video and the text description, Sora achieves two things:

It learns key concepts like spatial or temporal awareness
It also learns a textual understanding of the videos thanks to the text description

This way, over time Sora learns to generate new videos that are semantically related to the given text description. Thus, just like with any neural network, they simply learn by rote trial and error.

But what makes neural networks great?

Well, besides stating the obvious in the sense that they generate amazing outputs, the key to their greatness is that they can learn through observation, just like a human would.

In other words, insofar as the outputs of Sora are concerned, it's fairly clear that the model understands space and time.

It consistently generates the same objects through various frames, with almost no object imperfections, and understands key concepts like distance, angles, perspective, and so on.

Amazingly, it learned all these things through mere observation, not through explicit formulas.

Put simply, nobody taught Sora the laws of physics, it simply induced them by observing those motions millions of times, just like humans don’t process the laws of physics in their brains every time they want to perform a movement.

That is why neural networks are considered universal function approximators, algorithms that can build incredibly complex models of our world.

In the case of ChatGPT, it has built a statistical model of the human language, and in the case of Sora, a model that simulates the world given a text description.

There’s no way a human can write those functions in pen and paper, which is why neural networks are so great, yet deeply misunderstood.

However, seeing the stupid mistakes it still makes, we arrive at a simple conclusion:

To this day, even after Sora’s release, we can confidently say that AI still misses the key to human intelligence, causal reasoning.

In other words:

❝

Sora understands physics, but it doesn’t understand cause and effect.

And here, my friends, is where world models come in, the key that might turn models like Sora into intelligent… “beings”.

Subscribe to Leaders to read the rest.

Become a paying subscriber of Leaders to get access to this post and other subscriber-only content.

Upgrade

Already a paying subscriber? Sign In

A subscription gets you:

High-signal deep-dives into the most advanced AI in the world in a easy-to-understand language

Additional insights to other cutting-edge research you should be paying attention to

Curiosity-inducing facts and reflections to make you the most interesting person in the room