- TheTechOasis
- Posts
- FAn, the Model that Tracks Anything
FAn, the Model that Tracks Anything
š TheTechOasis š
Imagine itās 2030.
Youāre walking your dog late at night. The atmosphere is dead quiet, but you hear a buzzing sound somewhere around you.
You feel weirdā¦ as if you were being watched. As something was following you.
You try to hide from open spaces, but every time you come into the open, you hear it again. Eventually, you can even see it.
Itās a drone, and itās following you. And thereās nothing you can do about it.
Source: Author with MidJourney
Seems like yet another CyberPunk story, right?
But, to your surprise, that technology already exists, and youāre going to see it for yourself.
The ultimate tracker
Two of the most prestigious universities in the world, MIT and Harvard, have collaborated to create FAn, the āFollow Anything modelā.
FAn is an open-vocabulary-any-object tracker.
In other words, itās capable of tracking any object you desire, as long as itās in its camera view.
Itās also multimodal, meaning that it can track objects you define with words, like āa ballā, by clicking on the object in the frame, or even using bounding boxes.
Thatās all you need, and FAn will track it with insane accuracy.
Even in cases where the object temporally disappears, FAn is capable of redetecting the object and continuing to track it.
The applications are as life-changing for some as extremely dangerous for others.
But whatās the intuition behind it?
The world of feature spaces
One of AI's most important, yet misunderstood, discussions is the idea of AI representations.
You see, to make AI work, data needs to be in numerical format, as computers only understand numbers. For such, we use vector embeddings.
These are numerical representations of the data you give the model, be that text, images, spectrograms, you name it.
For instance, the word ādogā will take a form such as:
dog = [0.02, -0.5, 0.34,...];
In order to understand the patterns and abstract meaning behind data, AI models create this compressed, n-dimensional numerical space (called a feature space or embedding space in AI literature) that has one principal law:
Similar concepts in real life should have similar vectors in this space. Dissimilar concepts should have different vectors.
For instance:
# Domestic mammals
dog = [0.02, -0.5, 0.34,...];
cat = [0.03, -0.43, 0.32,...];
# Static objects
door = [-0.9, 0.04, -0.0125,...];
This is how machines understand meaning and context. ChatGPT has such representations, as so does MidJourney.
Additionally, one of the most dynamic fields of AI research is making these spaces āmultimodalā, or capable of placing together similar concepts in different formats:
# Domestic mammals
"a Siberian Husky" = [0.01, -0.55, 0.34,...];
Siberian_husky.img = [0.01, -0.55, 0.34,...];
"a wild cat" = [0.03, -0.34, 0.29,...];
# A dissimilar object
white_door.img. = [-0.9, 0.04, -0.0125,...];
"a white door". = [-0.9, 0.04, -0.0125,...];
In case wondering, to evaluate how similar two vectors are, we calculate vector similarity using several techniques such as cosine similarity or Euclidean distance
Now, FAn takes these concepts and builds them into the best zero-shot autonomous tracker.
But how does it work?
Modular architectures are here to stay
As Iāve said time and time again, AI is top of mind today thanks to the concept of foundation models like GPT powering solutions like ChatGPT.
Previously, models were trained to very domain-specific use cases and classes. For instance, FAn would have been built to detect a particular object, and would work decently in very constrained environments.
Thanks to foundation models, FAn will trace any object you want to, in any domain, as they are capable of generalizing into data classes they wonāt have necessarily seen before, and in real-time (thatās the definition of a foundation model).
Incredibly, FAn packs two foundation models in the same solution.
Letās view the complete stack:
Hardware
A physical tracker (a drone in the experiments) that includes a camera
A ground system, communicated with the drone, receives the video frames, and outputs the movements the drone has to make
Software
SAM: Metaās segmentation model, receives an image and segments it into objects:
DINO/CLIP: Using one or the other, this foundation model extracts the most relevant features of the image
And how does it all fit together?
Similar things stay together
Although bringing all this to fruition required some of the brightest minds from Harvard and MIT, the procedure is actually quite understandable.
For every frame the camera attached to the drone captures, FAn first calculates the segmentation of the image using SAM, obtaining something similar to the previous image by separating the frame into masks that refer to different objects
Then, for every pixel in the frame, DINO or CLIP (researchers settled for DINO eventually) calculates the vector embedding. This vector captures the semantic meaning of the pixel.
Furthermore, the model then groups the DINO vectors of one pixel with other pixels inside the same masked object detected by SAM. This way, every identified object in the image is assigned one unique vector embedding.
Next, FAn takes the vector embedding of the query (a text, click, or bounding box the user sends specifying what object to track) and applies cosine similarity with the vector embeddings of the different masks in the image frame, which we will call mask embeddings.
As the query embedding and the mask embedding both capture the semantic meaning of their underlying data, the mask embedding with the highest score when compared to the query embedding is assigned.
For instance, if the query was āa whaleā, itās query embedding will be assigned to the mask that captured the āwhaleā object in the image.
And all this takes you to videos such as these:
Autonomous drone tracking a car. Source: MIT/Harvard
Additionally, to redetect temporally hidden objects, the model stores in memory the representation of the assigned mask recurrently over a parametrizable number of steps, meaning that the model will reinitiate tracking the moment a mask with a similar embedding to the one stored āreappearsā.
Tech from hell, or heaven?
Itās hard to see if such technology will be positive or negative for humans. To me, it will depend on who has it.
Used by governments, we can all now envision the CyberPunk story from the beginning.
Used by a YouTuber, you can foresee incredible āin real lifeā videos that portray the lives of celebrities in a unique way. Or with robotics, taking manufacturing to a completely new level.
In the meantime, we can all but marvel at the crazy developments we are witnessing in AI recently.
Key AI concepts to retain from todayās newsletter:
- First-ever open-vocabulary object tracker
- The importance of understanding what embeddings are
- The increasing relevance of multimodal solutions
š¾Top AI news for the weekš¾
š® Forget about SEO, AI engine optimization is the future
š©š½āš [New Section!]: Leaders š§š»āš
Iām very pleased to announce the launch of my first premium service, Leaders. Leaders will offer content that wonāt be uploaded anywhere else (not even my Medium account), and that isnāt clickbaity or trendy just to get clicks.
But what is it?
Itās a focus on the present. I tend to indulge quite a bit in the future of AI, but the present is just as important. For that, Leaders aim to become a beacon of light for you to start leveraging AI now.
But why itās not free?
Iāll be honest with you. Currently, I canāt make a living on writing for you guys. It is what it is. Thus, I work as a management consultant in PwC to make ends meet. 12-15 hours shifts that luckily for me are bearable because I work as an AI advisor for big companies.
Consequently, I write this newsletter in the little free time I have left.
However, dedicating my life to learning, educating, and advising about AI is my real passion, and this is my first step in that direction. But I know my premium service isnāt cheap, so I get it if you donāt (or canāt) pay for it.
Itās fine, I appreciate you anyways.
That being said, If I may, Iām pretty sure Leaders is worth every penny.
I mean, itās all about relativity. For the price of a sub-par cocktail in any bar in downtown {insert any medium-sized city}, youāre getting access to actionable insights regarding the technology that will change society.
Before anyone else.
Anyways, whatever you decide to do, Iām really happy and thankful to have you here, no matter what, and for those of you who stay in the free tier, all previous content will stay free, forever.
Thatās a promise, and Iām a man of my word, because thatās what Leaders do.
Subscribe to Full Premium package to read the rest.
Become a paying subscriber of Full Premium package to get access to this post and other subscriber-only content.
Already a paying subscriber? Sign In.
A subscription gets you:
- ā¢ NO ADS
- ā¢ An additional insights email on Tuesdays
- ā¢ Gain access to TheWhiteBox's knowledge base to access four times more content than the free version on markets, cutting-edge research, company deep dives, AI engineering tips, & more