I-JEPA, Humanity's First Attempt at Building our First AI World Model
🏝 TheTechOasis 🏝
If LLMs are the hottest thing in AI, then world models are the holy grail.
They represent a vision of an AI that learns about our world not by brute force or rote memorization like ChatGPT, but by forming abstract representations of it, much like humans do.
In this divine narrative, I-JEPA (Image-based Joint-Embedding Predictive Architecture), built by Meta, emerges as the first tangible success in realizing this vision.
It needs ten times fewer resources and no human-crafted tricks to help machines understand the simplest of concepts about our world, offering a glimpse of a future where AI learns not by millions in GPU expenditure, but by real understanding.
Common Sense AI
Let me ask you a question: How intelligent do you think GPT-4 really is?
Well, according to Yann LeCun, Chief Scientist at Meta, hailed as one of the top three most important AI scientists in history, “less than a dog”.
Because they lack common sense.
Think about learning to drive a car. On average, a human teen takes around 20 hours of learning to do it decently.
Autonomous driving systems, on the other hand, require thousands of hours of training with billions of data points, only to have inferior driving capabilities.
And the way to solve this could be world models.
How humans see the world
According to cognitive theory, world models are abstract representations that the human brain creates from the world it lives in to help us interact and, basically, survive in this environment.
These world models are capable of predicting unforeseen events to help drive our actions and minimize the chance of harm or death.
In other words, they are hypothesized to be what we describe as “common sense”.
And what’s pretty clear today is that our best AI models lack this today.
Dogs and Heights
Comparing ChatGPT with dogs as Yann did, we can clearly understand how different the learning approaches are.
For instance, a dog knows that jumping from a third-story balcony isn’t the best idea when it comes to survival, even though that dog has never, or will never, need to jump from such heights to corroborate that.
However, to train an AI robot, you must lure it to jump (usually done in a simulated environment) for it to understand that to preserve its integrity, it must avoid jumping from high places.
In other words, dogs learn from the world by simply observing it and avoid taking certain actions because they could cause them harm, without the need to actually experience that harm most of the time.
That’s common sense.
But then, what’s this new way of training?
Artificial World Models
When training standard AI models to classify objects in an image, AI researchers need to feed the model different views of each image.
Rotation, zooming, cropping, etc.
But do humans need that to do that?
No, we simply view the image and understand there’s a dog in it, even though maybe it’s a breed we’ve never seen.
And do we need to understand what color the grass where the dog stands, or the color of the sky?
No, we’ve learned abstract representations of a dog that are sufficient for us to identify it (four legs, paws, etc.).
Machines, however, analyze every single pixel in the image to reach the conclusion that, among all those thousands of pixels, a certain number of them are grouped in a way that depicts a dog.
There’s no common sense there, it’s learning by brute force.
Naturally, this means that these models require thousands of images for them to understand what a dog is.
Next, if you want to classify breeds, you’ll need thousands of images… per breed.
And, of course, you must show dogs in different landscapes, because if you only show images of dogs in parks, it won’t identify them in an apartment, for instance.
Luckily, I-JEPA is the first model that resembles our ways of learning.
A Dog is always a Dog
With little training (just like what humans need) AI models should be able to see a dog in any scenario possible and still be capable of understanding that’s a dog.
To do this, I-JEPA has the following architecture:
In simple terms, I-JEPA only views a small part of the image and it’s trained to predict representations of other blocks in the image (depicted in colors above).
What’s more, this prediction is done in the representation space, avoiding forcing the model to reconstruct the complete image and avoiding having to learn unnecessary parts of it (like the grass).
Additionally, by learning to predict other patches of the same object correctly, the model obtains a much deeper understanding of what the dog in the image really is.
Thus, by exposing models to a partially-observable view of reality, you train these models to handle uncertainty.
For instance, if you see your dog’s face lurking through the door of your bedroom, you don’t need to see the complete dog to know it’s there, as your world model handles that uncertainty for you.
That’s what Meta is trying to build with I-JEPA.
Abstraction is Intelligence
This idea of world models has grown in me.
Also, the fact that I-JEPA matches the state-of-the-art models in the industry (DINO, iBOT, etc.) with ten times the training and with much potential for scalability, helps.
As I-JEPA has developed a much deeper understanding of what it sees, it doesn’t require millions of images and training hours to understand what it’s seeing… just like a human would.
To be honest, after seeing I-JEPA, I don’t think ChatGPT is the path to superintelligence.
But if we manage to create super world models… that’s a different story. And with I-JEPA, Meta starts leading the way.