Figure, SIMA, & Devin, What a Week in AI!

Sponsored by

šŸ TheTechOasis šŸ

Breaking down the most advanced AI systems in the world to prepare you for the future.

5-minute weekly reads.

TLDR:

  • AI Research of the Week: Figure and Google Jump into the Robot Craze

  • Leaders: After Devin, the World Might Never be the Same Again

šŸ’Ž This Weekā€™s Sponsor šŸ’Ž

To be a Leader in the future, having AI skills is going to be a must.

And few global institutions require less of an introduction than MIT and its short course on AI & Business Strategy.

Whatā€™s more, you can get a 20% discount by using code IWD2024 at checkout. Turn insights into educated action with MIT today.

Artificial Intelligence online short course from MIT

Study artificial intelligence and gain the knowledge to support its integration into your organization. If you're looking to gain a competitive edge in today's business world, then this artificial intelligence online course may be the perfect option for you.

  • Key AI management and leadership insights to support informed, strategic decision making.

  • A practical grounding in AI and its business applications, helping you to transform your organization into a future-forward business.

  • A road map for the strategic implementation of AI technologies in a business context.

šŸ¤© AI Research of the week šŸ¤©

This week, we have witnessed a great leap in robotics from three fronts, Figure AIā€™s talking robot, Googleā€™s generalistic agent SIMA, and Devin the AI software engineer from Cognition Labs, the latter of which takes up this weekā€™s complete Leaders segment.

Thus, in todayā€™s research of the week, we are focusing on Figure and SIMA.

They are far from being at the human level, meaning that some lavish claims like ā€˜this is AGIā€™ are unsubstantiated, but with AI itā€™s always not about what these things are capable of now, but what they will become and, in my view, we are seeing the first baby steps of AI embodied intelligence.

Embodying GPT-4

So, what is Figure AI?

Figure AI is a robotics company that is building robots to, in the words of the CEO, ā€œeliminate the need for unsafe and undesirable jobs, allowing future generations to live happier, more purposeful lives.ā€

Youā€™ve probably heard of similar companies, but when OpenAI, NVIDIA, Jeff Bezos, and Intel all invest in a series round of $675 million at a $2.6 billion valuation for a company with no products on the market, you know they are on to something.

But why is everyone talking about this company? Well, itā€™s because of this video.

I highly recommend you click on that link, but long story short itā€™s a robot that interacts with a human and, whilst doing so, performs several actions with great dexterity.

At the core of Figure AIā€™s robot sits none other than GPT-4V, OpenAIā€™s flagship Multimodal Large Language Model, or MLLM.

In other words, Figure AIā€™s robot is the first time we have seen an ā€˜embodied ChatGPTā€™, meaning that LLMs are now capable of taking embodied actions too.

However, although it may have gone over most peopleā€™s heads, I want you to pay attention to the moment that the human asks the robot to act while it explains the reasoning behind the previous action it did.

Now, that was not a trivial question, it was done on purpose to showcase the capacity of the model to multitask.

Specifically, it seems they have fine-tuned GPT-4 into outputting text and action representations, the former being decoded into speech through a vocoder, and the latter decoded into actuator motion to move the body.

While we donā€™t know much of the underlying mechanisms of Figure AIā€™s robots, we can get a pretty good idea with examples like Deepmindā€™s RT-2 model, where researchers have already proven how to take an LLM and train it to output robot actions or text.

Now, RT-2 was trained using PALM-E and PALI-X, two LLMs that are nowhere close to GPT-4ā€™s capabilities.

Thus, it could be the case that the model running Figureā€™s Robots is the most advanced vision-language-action model to ever exist.

But letā€™s be real.

That was a highly-constrained demo that could have been rehearsed multiple times. We can almost guarantee that the robot is still in its infancy in terms of general-purpose capabilities.

Therefore, in an effort to envision what the next step in the journey for these robots is, we can take inspiration from what Google has just released, SIMA, the 3D generalistic agent.

SIMA, The First True Generalistic Agent?

Googleā€™s Scalable Instructable Multiworld Agent (SIMA) project aims to create artificial intelligence (AI) systems that can understand and execute arbitrary language instructions in any 3D environment.

This initiative addresses a significant challenge in general AI development by grounding language in perception and embodied actions across diverse virtual worlds, including research environments and commercial video games.

In laymanā€™s terms, SIMA takes in language requests and turns them into keyboard-and-mouse actions in 3D environments. But hereā€™s the key element to take into consideration:

The goal is for these agents to accomplish tasks as humans would, using natural language instructions given by a user, like ā€œgather resources and build a houseā€.

In other words, the model has exactly the same inputs as us when interacting in those environments, which means it has no access to underlying APIs or anything sort like it, thus the only way for it to accomplish the demanded tasks is by predicting the keyboard and mouse actions that a human would perform to achieve that same task.

The model is composed of the following components:

  1. Encoders:

    • A text encoder translates language instructions into embeddings the model can interpret.

    • An image encoder based on the recent development of SPARC.

    • A video encoder based on Phenaki

  2. Multi-modal Transformer + Transformer XL: Two transformer architectures, the former performs cross-attention between modalities, and the latter takes on previous states to identify the new state.

  3. The policy: A classification head where the chosen action is decided based on the policy

Thereā€™s a lot to unpack here, so letā€™s go step-by-step.

Processing the world

As in most frontier models today, the first step is to ā€œencodeā€ the inputs.

In laymanā€™s terms, the idea is to take the input data (text and video in our case) and turn these data points into vector embeddingsā€”a sequence of numbersā€”using their respective encoders.

But why would you want to do this?

Well, by performing this transformation, each element is represented in a dense vector that contains the semantic meaning of the concept.

In other words, similar concepts will have similar vectors, meaning that they will be closer in the vector space when represented:

Also, as the concepts are now expressed in numerical form, the model can compute the similarity among these vectors (the distance between them) to know what concepts are similar to others.

This way, the model will know that the words ā€˜dogā€™ and ā€˜catā€™ refer to similar concepts.

In multimodal situations like the one we are discussing, where the vectors come from different data types, similarity also plays a crucial role. The underlying idea is that you want the word ā€˜dogā€™ and an ā€˜image of a dogā€™ to share similar vectors, indicating the model that they refer to the same concept.

But as text and images are very different data types structure-wise, you are going to require different encoding systems.

This is an issue, because you have to train separate encoders while ensuring they result in similar embeddings for similar concepts.

To solve this, SIMA uses a SPARC image encoder.

A very, very recent breakthrough, SPARC encoders are trained in a very similar way to almost every other image encoder (using contrastive learning) but are much better at capturing fine-grained details.

What is contrastive learning?

Very common training method aimed for models that work with text and images. It works by pushing similar concepts closer while pushing dissimilar concepts apart.

By looking at millions of images and their text descriptions it learns to capture what an image depicts.

The issue is that most common image encoders fail to capture the local details of the image.

Yes, theyā€™ll indicate to you what the image is about, but will miss important smaller details that do not help to describe the global semantics of the image but are nevertheless important in many cases.

SPARC proposes a similar method, but adds a very interesting twist.

As an example, letā€™s say we have an image-text pair that represents ā€œcat and dog in a basketā€.

  • First, SPARC breaks the image into patches.

  • Then, for each patch, it assigns one of the tokens in the text description to it. If the patch covers a part of the dogā€™s body, then the patch will be assigned the word ā€œdogā€.

  • This is done for every patch, and in those patches that cover several aspects of the text description, it assigns a weighted value to each.

For instance, if the patch covers the body of the cat and the basket is a little bit visible (as the two patches to the right), the model mainly assigns the word ā€œcatā€ to the patch but also the word ā€œbasketā€, the latter with less relative weight.

  • Consequently, for each text token, all the image patches that are assigned to that token are grouped.

In the above example, they take the weight assigned to the word ā€˜dogā€™ in each patch, update the complete value of that group, and then compare it with the actual text token embeddings.

In laymanā€™s terms, the key difference between SPARC and other image encoders resides in the fact that it will assign individual words in the text description to specific parts of the image.

This way, if the grouped embedding of certain parts of the image is heavily skewed to the word token ā€˜dogā€™, that area of the image most probably includes a dog.

That way, the model not only learns that the image showcases a dog and a cat in a basket, but also learns where each element, like ā€˜dogā€™, ā€˜catā€™, and ā€˜basketā€™, is situated.

This was a crucial element for SIMA, as the 3D agent would have to be capable of identifying specific objects in its surroundings to interact with as part of the requested tasks.

Moving on to the video encoder, it is included so that the model can account for past states. Video encoders include temporal awareness, something text or image encoders canā€™t provide.

This is important because the next action to take will depend not only on the current state of the environment but also on the past states of the environment and actions taken.

For instance, the best action to take next might be lighting up a match, but itā€™s probably not the best idea if the previous action was to cover the floor in gasoline.

Choosing the best policy

With the provided information, SIMA then uses a set of transformer models (just like ChatGPT would) to take the representations generated by the different encoders, and, instead of predicting the word as an LLM would, this model outputs the policy that dictates the keyboard and mouse actions to be executed by the agent.

On a side note, you may be wondering why they used such a weird set of Transformers as the main ā€˜brainā€™ of the model instead of simply using Gemini, Googleā€™s MLLM.

The reasons were probably due to budget, as the researchers themselves acknowledge in the technical report that the obvious next step for SIMA is to use Gemini instead.

Despite not using the best ā€˜brainā€™ available, they obtained very interesting results we are about to see now.

A True Generalizer

As you may imagine by now, the objective all along was to train an agent that wouldnā€™t have to be the best in every game, but sufficiently good at any game it played.

After training, the SIMA agent was capable of performing up to 600 different basic tasks, grouped into different categories such as navigation, animal interaction, food, etc:

You can check SIMA performing some of these actions here. Moreover, SIMA yielded very promising results worth mentioning.

For starters, despite being trained in many different games in parallel, on average SIMA performed better than agents specialized in playing one single gameā€¦ in that specific game:

More impressively, across several different games, the agent achieved non-trivial performance results with zero-shot tasks in most of them while again beating the specialized agents:

This meant that even when put in a previously unseen environment, the model performed well, and in some instances like in the case of Goat Simulator 3, it outperformed the specialized agent (the agent only trained on playing that game).

But what does this all mean?

Simply put, we are observing tangible evidence of knowledge transfer among games.

In other words, the model learns meaningful skills from some games that can be applied effectively to others (like learning to move with the keyboard).

Whatā€™s more, these skills are of such high quality that this generalistic agent beats the specialized ones in many games, signaling that this generalistic approach helps them learn superior skills to be applied across environments.

Very impressive results overall, that gain importance when put into perspective with Figure AIā€™s developments.

A Great Week for Robotics

This has been a great week for robotics AI in general.

  • On one side, Figure AI proves we are getting better at building humanoids that can inform an increasing range of manual tasks.

  • On the other hand, SIMA showcases that we are starting to see our first generalistic agents across 3D environments.

But what we realize is the potential for synergies.

We might not be there yet in terms of taking these agents into real-life situations, but the convergence among these two fields is the next natural step; SIMA as the training ground, Figure AI robots as the embodiment of the generalistic agents.

And with other companies like Covariant launching their own takes on embodied intelligence (link below), itā€™s clear that many incumbents feel that humanityā€™s technologies are ripe enough to take on the next big challenge, deploying AI into real life.

šŸ‘¾ Best news of the week šŸ‘¾

āš”ļø The 100 most used GenAI apps according to a16z

āš”ļø xAI to open-source Grok, according to Elon Musk

āš”ļø Covariant launches RFM-1, the LLM for robots

šŸ„‡ This week on Leadersā€¦ šŸ„‡

We will discuss the emergence of Devin, the first-ever, fully-autonomous, AI software engineer, and the biggest news probably since the release of the original ChatGPT back in November 2022 (at least in terms of reactions).

However, we are going beyond the hype and embarrassing ā€˜this is AGIā€™ claims to focus on the real impact itā€™s going to have and answer the impending question ā€˜Soā€¦ what now?ā€™

By understanding how they built such an amazing ā€˜thingā€™, and reflecting on the connotations that such an event implies for the future of human labor, we will uncover the key points you should expect will have immediate impact on your life.

You will also see high-signal insights from the likes of Apple, Microsoft, Deepseek, and many more.

Do you have any feelings, questions, or intuitions you want to share with me? Reach me at [email protected]