FAn, the Model that Tracks Anything
🏝 TheTechOasis 🏝
Imagine it’s 2030.
You’re walking your dog late at night. The atmosphere is dead quiet, but you hear a buzzing sound somewhere around you.
You feel weird… as if you were being watched. As something was following you.
You try to hide from open spaces, but every time you come into the open, you hear it again. Eventually, you can even see it.
It’s a drone, and it’s following you. And there’s nothing you can do about it.
Source: Author with MidJourney
Seems like yet another CyberPunk story, right?
But, to your surprise, that technology already exists, and you’re going to see it for yourself.
The ultimate tracker
Two of the most prestigious universities in the world, MIT and Harvard, have collaborated to create FAn, the “Follow Anything model”.
FAn is an open-vocabulary-any-object tracker.
In other words, it’s capable of tracking any object you desire, as long as it’s in its camera view.
It’s also multimodal, meaning that it can track objects you define with words, like “a ball”, by clicking on the object in the frame, or even using bounding boxes.
That’s all you need, and FAn will track it with insane accuracy.
Even in cases where the object temporally disappears, FAn is capable of redetecting the object and continuing to track it.
The applications are as life-changing for some as extremely dangerous for others.
But what’s the intuition behind it?
The world of feature spaces
One of AI's most important, yet misunderstood, discussions is the idea of AI representations.
You see, to make AI work, data needs to be in numerical format, as computers only understand numbers. For such, we use vector embeddings.
These are numerical representations of the data you give the model, be that text, images, spectrograms, you name it.
For instance, the word ‘dog’ will take a form such as:
dog = [0.02, -0.5, 0.34,...];
In order to understand the patterns and abstract meaning behind data, AI models create this compressed, n-dimensional numerical space (called a feature space or embedding space in AI literature) that has one principal law:
Similar concepts in real life should have similar vectors in this space. Dissimilar concepts should have different vectors.
# Domestic mammals dog = [0.02, -0.5, 0.34,...]; cat = [0.03, -0.43, 0.32,...]; # Static objects door = [-0.9, 0.04, -0.0125,...];
This is how machines understand meaning and context. ChatGPT has such representations, as so does MidJourney.
Additionally, one of the most dynamic fields of AI research is making these spaces ‘multimodal’, or capable of placing together similar concepts in different formats:
# Domestic mammals "a Siberian Husky" = [0.01, -0.55, 0.34,...]; Siberian_husky.img = [0.01, -0.55, 0.34,...]; "a wild cat" = [0.03, -0.34, 0.29,...]; # A dissimilar object white_door.img. = [-0.9, 0.04, -0.0125,...]; "a white door". = [-0.9, 0.04, -0.0125,...];
Now, FAn takes these concepts and builds them into the best zero-shot autonomous tracker.
But how does it work?
Modular architectures are here to stay
As I’ve said time and time again, AI is top of mind today thanks to the concept of foundation models like GPT powering solutions like ChatGPT.
Previously, models were trained to very domain-specific use cases and classes. For instance, FAn would have been built to detect a particular object, and would work decently in very constrained environments.
Thanks to foundation models, FAn will trace any object you want to, in any domain, as they are capable of generalizing into data classes they won’t have necessarily seen before, and in real-time (that’s the definition of a foundation model).
Incredibly, FAn packs two foundation models in the same solution.
Let’s view the complete stack:
A physical tracker (a drone in the experiments) that includes a camera
A ground system, communicated with the drone, receives the video frames, and outputs the movements the drone has to make
SAM: Meta’s segmentation model, receives an image and segments it into objects:
DINO/CLIP: Using one or the other, this foundation model extracts the most relevant features of the image
And how does it all fit together?
Similar things stay together
Although bringing all this to fruition required some of the brightest minds from Harvard and MIT, the procedure is actually quite understandable.
For every frame the camera attached to the drone captures, FAn first calculates the segmentation of the image using SAM, obtaining something similar to the previous image by separating the frame into masks that refer to different objects
Then, for every pixel in the frame, DINO or CLIP (researchers settled for DINO eventually) calculates the vector embedding. This vector captures the semantic meaning of the pixel.
Furthermore, the model then groups the DINO vectors of one pixel with other pixels inside the same masked object detected by SAM. This way, every identified object in the image is assigned one unique vector embedding.
Next, FAn takes the vector embedding of the query (a text, click, or bounding box the user sends specifying what object to track) and applies cosine similarity with the vector embeddings of the different masks in the image frame, which we will call mask embeddings.
As the query embedding and the mask embedding both capture the semantic meaning of their underlying data, the mask embedding with the highest score when compared to the query embedding is assigned.
And all this takes you to videos such as these:
Autonomous drone tracking a car. Source: MIT/Harvard
Additionally, to redetect temporally hidden objects, the model stores in memory the representation of the assigned mask recurrently over a parametrizable number of steps, meaning that the model will reinitiate tracking the moment a mask with a similar embedding to the one stored ‘reappears’.
Tech from hell, or heaven?
It’s hard to see if such technology will be positive or negative for humans. To me, it will depend on who has it.
Used by governments, we can all now envision the CyberPunk story from the beginning.
Used by a YouTuber, you can foresee incredible “in real life” videos that portray the lives of celebrities in a unique way. Or with robotics, taking manufacturing to a completely new level.
In the meantime, we can all but marvel at the crazy developments we are witnessing in AI recently.
Key AI concepts to retain from today’s newsletter:
👾Top AI news for the week👾
🔮 Forget about SEO, AI engine optimization is the future
👩🏽🎓 [New Section!]: Leaders 🧑🏻🎓
Subscribe to Leaders to read the rest.
Become a paying subscriber of Leaders to get access to this post and other subscriber-only content.
Already a paying subscriber? Sign In