FAn, the Model that Tracks Anything

🏝 TheTechOasis 🏝

Imagine it’s 2030.

You’re walking your dog late at night. The atmosphere is dead quiet, but you hear a buzzing sound somewhere around you.

You feel weird… as if you were being watched. As something was following you.

You try to hide from open spaces, but every time you come into the open, you hear it again. Eventually, you can even see it.

It’s a drone, and it’s following you. And there’s nothing you can do about it.

Source: Author with MidJourney

Seems like yet another CyberPunk story, right?

But, to your surprise, that technology already exists, and you’re going to see it for yourself.

The ultimate tracker

Two of the most prestigious universities in the world, MIT and Harvard, have collaborated to create FAn, the “Follow Anything model”.

FAn is an open-vocabulary-any-object tracker.

In other words, it’s capable of tracking any object you desire, as long as it’s in its camera view.

It’s also multimodal, meaning that it can track objects you define with words, like “a ball”, by clicking on the object in the frame, or even using bounding boxes.

That’s all you need, and FAn will track it with insane accuracy.

Even in cases where the object temporally disappears, FAn is capable of redetecting the object and continuing to track it.

The applications are as life-changing for some as extremely dangerous for others.

But what’s the intuition behind it?

The world of feature spaces

One of AI's most important, yet misunderstood, discussions is the idea of AI representations.

You see, to make AI work, data needs to be in numerical format, as computers only understand numbers. For such, we use vector embeddings.

These are numerical representations of the data you give the model, be that text, images, spectrograms, you name it.

For instance, the word ‘dog’ will take a form such as:

dog = [0.02, -0.5, 0.34,...];

In order to understand the patterns and abstract meaning behind data, AI models create this compressed, n-dimensional numerical space (called a feature space or embedding space in AI literature) that has one principal law:

Similar concepts in real life should have similar vectors in this space. Dissimilar concepts should have different vectors.

For instance:

# Domestic mammals
dog  = [0.02, -0.5, 0.34,...];
cat  = [0.03, -0.43, 0.32,...];
# Static objects
door = [-0.9, 0.04, -0.0125,...];

This is how machines understand meaning and context. ChatGPT has such representations, as so does MidJourney.

Additionally, one of the most dynamic fields of AI research is making these spaces ‘multimodal’, or capable of placing together similar concepts in different formats:

# Domestic mammals
"a Siberian Husky" = [0.01, -0.55, 0.34,...];
Siberian_husky.img = [0.01, -0.55, 0.34,...];
"a wild cat"       = [0.03, -0.34, 0.29,...];
# A dissimilar object
white_door.img.    = [-0.9, 0.04, -0.0125,...];
"a white door".    = [-0.9, 0.04, -0.0125,...];

In case wondering, to evaluate how similar two vectors are, we calculate vector similarity using several techniques such as cosine similarity or Euclidean distance

Now, FAn takes these concepts and builds them into the best zero-shot autonomous tracker.

But how does it work?

Modular architectures are here to stay

As I’ve said time and time again, AI is top of mind today thanks to the concept of foundation models like GPT powering solutions like ChatGPT.

Previously, models were trained to very domain-specific use cases and classes. For instance, FAn would have been built to detect a particular object, and would work decently in very constrained environments.

Thanks to foundation models, FAn will trace any object you want to, in any domain, as they are capable of generalizing into data classes they won’t have necessarily seen before, and in real-time (that’s the definition of a foundation model).

Incredibly, FAn packs two foundation models in the same solution.

Let’s view the complete stack:

Hardware
  • A physical tracker (a drone in the experiments) that includes a camera

  • A ground system, communicated with the drone, receives the video frames, and outputs the movements the drone has to make

Software
  • SAM: Meta’s segmentation model, receives an image and segments it into objects:

  • DINO/CLIP: Using one or the other, this foundation model extracts the most relevant features of the image

And how does it all fit together?

Similar things stay together

Although bringing all this to fruition required some of the brightest minds from Harvard and MIT, the procedure is actually quite understandable.

  1. For every frame the camera attached to the drone captures, FAn first calculates the segmentation of the image using SAM, obtaining something similar to the previous image by separating the frame into masks that refer to different objects

  2. Then, for every pixel in the frame, DINO or CLIP (researchers settled for DINO eventually) calculates the vector embedding. This vector captures the semantic meaning of the pixel.

  3. Furthermore, the model then groups the DINO vectors of one pixel with other pixels inside the same masked object detected by SAM. This way, every identified object in the image is assigned one unique vector embedding.

  4. Next, FAn takes the vector embedding of the query (a text, click, or bounding box the user sends specifying what object to track) and applies cosine similarity with the vector embeddings of the different masks in the image frame, which we will call mask embeddings.

  5. As the query embedding and the mask embedding both capture the semantic meaning of their underlying data, the mask embedding with the highest score when compared to the query embedding is assigned.

For instance, if the query was “a whale”, it’s query embedding will be assigned to the mask that captured the ‘whale’ object in the image.

And all this takes you to videos such as these:

Autonomous drone tracking a car. Source: MIT/Harvard

Additionally, to redetect temporally hidden objects, the model stores in memory the representation of the assigned mask recurrently over a parametrizable number of steps, meaning that the model will reinitiate tracking the moment a mask with a similar embedding to the one stored ‘reappears’.

Tech from hell, or heaven?

It’s hard to see if such technology will be positive or negative for humans. To me, it will depend on who has it.

Used by governments, we can all now envision the CyberPunk story from the beginning.

Used by a YouTuber, you can foresee incredible “in real life” videos that portray the lives of celebrities in a unique way. Or with robotics, taking manufacturing to a completely new level.

In the meantime, we can all but marvel at the crazy developments we are witnessing in AI recently.

Key AI concepts to retain from today’s newsletter:

- First-ever open-vocabulary object tracker

- The importance of understanding what embeddings are

- The increasing relevance of multimodal solutions

👾Top AI news for the week👾

🔮 Forget about SEO, AI engine optimization is the future

👩🏽‍🎓 [New Section!]: Leaders 🧑🏻‍🎓

I’m very pleased to announce the launch of my first premium service, Leaders. Leaders will offer content that won’t be uploaded anywhere else (not even my Medium account), and that isn’t clickbaity or trendy just to get clicks.

But what is it?

It’s a focus on the present. I tend to indulge quite a bit in the future of AI, but the present is just as important. For that, Leaders aim to become a beacon of light for you to start leveraging AI now.

But why it’s not free?

I’ll be honest with you. Currently, I can’t make a living on writing for you guys. It is what it is. Thus, I work as a management consultant in PwC to make ends meet. 12-15 hours shifts that luckily for me are bearable because I work as an AI advisor for big companies.

Consequently, I write this newsletter in the little free time I have left.

However, dedicating my life to learning, educating, and advising about AI is my real passion, and this is my first step in that direction. But I know my premium service isn’t cheap, so I get it if you don’t (or can’t) pay for it.

It’s fine, I appreciate you anyways.

That being said, If I may, I’m pretty sure Leaders is worth every penny.

I mean, it’s all about relativity. For the price of a sub-par cocktail in any bar in downtown {insert any medium-sized city}, you’re getting access to actionable insights regarding the technology that will change society.

Before anyone else.

Anyways, whatever you decide to do, I’m really happy and thankful to have you here, no matter what, and for those of you who stay in the free tier, all previous content will stay free, forever.

That’s a promise, and I’m a man of my word, because that’s what Leaders do.

Ignacio ‘Nacho’ de Gregorio

Subscribe to Leaders to read the rest.

Become a paying subscriber of Leaders to get access to this post and other subscriber-only content.

Already a paying subscriber? Sign In

A subscription gets you:
High-signal deep-dives into the most advanced AI in the world in a easy-to-understand language
Additional insights to other cutting-edge research you should be paying attention to
Curiosity-inducing facts and reflections to make you the most interesting person in the room