- TheTechOasis
- Posts
- Figure, SIMA, & Devin, What a Week in AI!
Figure, SIMA, & Devin, What a Week in AI!
š TheTechOasis š
Breaking down the most advanced AI systems in the world to prepare you for the future.
5-minute weekly reads.
TLDR:
AI Research of the Week: Figure and Google Jump into the Robot Craze
Leaders: After Devin, the World Might Never be the Same Again
š This Weekās Sponsor š
To be a Leader in the future, having AI skills is going to be a must.
And few global institutions require less of an introduction than MIT and its short course on AI & Business Strategy.
Whatās more, you can get a 20% discount by using code IWD2024 at checkout. Turn insights into educated action with MIT today.
Artificial Intelligence online short course from MIT
Study artificial intelligence and gain the knowledge to support its integration into your organization. If you're looking to gain a competitive edge in today's business world, then this artificial intelligence online course may be the perfect option for you.
On completion of the MIT Artificial Intelligence: Implications for Business Strategy online short course, youāll gain:
Key AI management and leadership insights to support informed, strategic decision making.
A practical grounding in AI and its business applications, helping you to transform your organization into a future-forward business.
A road map for the strategic implementation of AI technologies in a business context.
š¤© AI Research of the week š¤©
This week, we have witnessed a great leap in robotics from three fronts, Figure AIās talking robot, Googleās generalistic agent SIMA, and Devin the AI software engineer from Cognition Labs, the latter of which takes up this weekās complete Leaders segment.
Thus, in todayās research of the week, we are focusing on Figure and SIMA.
They are far from being at the human level, meaning that some lavish claims like āthis is AGIā are unsubstantiated, but with AI itās always not about what these things are capable of now, but what they will become and, in my view, we are seeing the first baby steps of AI embodied intelligence.
Embodying GPT-4
So, what is Figure AI?
Figure AI is a robotics company that is building robots to, in the words of the CEO, āeliminate the need for unsafe and undesirable jobs, allowing future generations to live happier, more purposeful lives.ā
Youāve probably heard of similar companies, but when OpenAI, NVIDIA, Jeff Bezos, and Intel all invest in a series round of $675 million at a $2.6 billion valuation for a company with no products on the market, you know they are on to something.
But why is everyone talking about this company? Well, itās because of this video.
I highly recommend you click on that link, but long story short itās a robot that interacts with a human and, whilst doing so, performs several actions with great dexterity.
At the core of Figure AIās robot sits none other than GPT-4V, OpenAIās flagship Multimodal Large Language Model, or MLLM.
In other words, Figure AIās robot is the first time we have seen an āembodied ChatGPTā, meaning that LLMs are now capable of taking embodied actions too.
However, although it may have gone over most peopleās heads, I want you to pay attention to the moment that the human asks the robot to act while it explains the reasoning behind the previous action it did.
Now, that was not a trivial question, it was done on purpose to showcase the capacity of the model to multitask.
Specifically, it seems they have fine-tuned GPT-4 into outputting text and action representations, the former being decoded into speech through a vocoder, and the latter decoded into actuator motion to move the body.
While we donāt know much of the underlying mechanisms of Figure AIās robots, we can get a pretty good idea with examples like Deepmindās RT-2 model, where researchers have already proven how to take an LLM and train it to output robot actions or text.
Now, RT-2 was trained using PALM-E and PALI-X, two LLMs that are nowhere close to GPT-4ās capabilities.
Thus, it could be the case that the model running Figureās Robots is the most advanced vision-language-action model to ever exist.
But letās be real.
That was a highly-constrained demo that could have been rehearsed multiple times. We can almost guarantee that the robot is still in its infancy in terms of general-purpose capabilities.
Therefore, in an effort to envision what the next step in the journey for these robots is, we can take inspiration from what Google has just released, SIMA, the 3D generalistic agent.
SIMA, The First True Generalistic Agent?
Googleās Scalable Instructable Multiworld Agent (SIMA) project aims to create artificial intelligence (AI) systems that can understand and execute arbitrary language instructions in any 3D environment.
This initiative addresses a significant challenge in general AI development by grounding language in perception and embodied actions across diverse virtual worlds, including research environments and commercial video games.
In laymanās terms, SIMA takes in language requests and turns them into keyboard-and-mouse actions in 3D environments. But hereās the key element to take into consideration:
The goal is for these agents to accomplish tasks as humans would, using natural language instructions given by a user, like āgather resources and build a houseā.
In other words, the model has exactly the same inputs as us when interacting in those environments, which means it has no access to underlying APIs or anything sort like it, thus the only way for it to accomplish the demanded tasks is by predicting the keyboard and mouse actions that a human would perform to achieve that same task.
The model is composed of the following components:
Encoders:
Multi-modal Transformer + Transformer XL: Two transformer architectures, the former performs cross-attention between modalities, and the latter takes on previous states to identify the new state.
The policy: A classification head where the chosen action is decided based on the policy
Thereās a lot to unpack here, so letās go step-by-step.
Processing the world
As in most frontier models today, the first step is to āencodeā the inputs.
In laymanās terms, the idea is to take the input data (text and video in our case) and turn these data points into vector embeddingsāa sequence of numbersāusing their respective encoders.
But why would you want to do this?
Well, by performing this transformation, each element is represented in a dense vector that contains the semantic meaning of the concept.
In other words, similar concepts will have similar vectors, meaning that they will be closer in the vector space when represented:
Also, as the concepts are now expressed in numerical form, the model can compute the similarity among these vectors (the distance between them) to know what concepts are similar to others.
This way, the model will know that the words ādogā and ācatā refer to similar concepts.
In multimodal situations like the one we are discussing, where the vectors come from different data types, similarity also plays a crucial role. The underlying idea is that you want the word ādogā and an āimage of a dogā to share similar vectors, indicating the model that they refer to the same concept.
But as text and images are very different data types structure-wise, you are going to require different encoding systems.
This is an issue, because you have to train separate encoders while ensuring they result in similar embeddings for similar concepts.
To solve this, SIMA uses a SPARC image encoder.
A very, very recent breakthrough, SPARC encoders are trained in a very similar way to almost every other image encoder (using contrastive learning) but are much better at capturing fine-grained details.
What is contrastive learning?
Very common training method aimed for models that work with text and images. It works by pushing similar concepts closer while pushing dissimilar concepts apart.
By looking at millions of images and their text descriptions it learns to capture what an image depicts.
The issue is that most common image encoders fail to capture the local details of the image.
Yes, theyāll indicate to you what the image is about, but will miss important smaller details that do not help to describe the global semantics of the image but are nevertheless important in many cases.
SPARC proposes a similar method, but adds a very interesting twist.
As an example, letās say we have an image-text pair that represents ācat and dog in a basketā.
First, SPARC breaks the image into patches.
Then, for each patch, it assigns one of the tokens in the text description to it. If the patch covers a part of the dogās body, then the patch will be assigned the word ādogā.
This is done for every patch, and in those patches that cover several aspects of the text description, it assigns a weighted value to each.
For instance, if the patch covers the body of the cat and the basket is a little bit visible (as the two patches to the right), the model mainly assigns the word ācatā to the patch but also the word ābasketā, the latter with less relative weight.
Consequently, for each text token, all the image patches that are assigned to that token are grouped.
In the above example, they take the weight assigned to the word ādogā in each patch, update the complete value of that group, and then compare it with the actual text token embeddings.
In laymanās terms, the key difference between SPARC and other image encoders resides in the fact that it will assign individual words in the text description to specific parts of the image.
This way, if the grouped embedding of certain parts of the image is heavily skewed to the word token ādogā, that area of the image most probably includes a dog.
That way, the model not only learns that the image showcases a dog and a cat in a basket, but also learns where each element, like ādogā, ācatā, and ābasketā, is situated.
This was a crucial element for SIMA, as the 3D agent would have to be capable of identifying specific objects in its surroundings to interact with as part of the requested tasks.
Moving on to the video encoder, it is included so that the model can account for past states. Video encoders include temporal awareness, something text or image encoders canāt provide.
This is important because the next action to take will depend not only on the current state of the environment but also on the past states of the environment and actions taken.
For instance, the best action to take next might be lighting up a match, but itās probably not the best idea if the previous action was to cover the floor in gasoline.
Choosing the best policy
With the provided information, SIMA then uses a set of transformer models (just like ChatGPT would) to take the representations generated by the different encoders, and, instead of predicting the word as an LLM would, this model outputs the policy that dictates the keyboard and mouse actions to be executed by the agent.
On a side note, you may be wondering why they used such a weird set of Transformers as the main ābrainā of the model instead of simply using Gemini, Googleās MLLM.
The reasons were probably due to budget, as the researchers themselves acknowledge in the technical report that the obvious next step for SIMA is to use Gemini instead.
Despite not using the best ābrainā available, they obtained very interesting results we are about to see now.
A True Generalizer
As you may imagine by now, the objective all along was to train an agent that wouldnāt have to be the best in every game, but sufficiently good at any game it played.
After training, the SIMA agent was capable of performing up to 600 different basic tasks, grouped into different categories such as navigation, animal interaction, food, etc:
You can check SIMA performing some of these actions here. Moreover, SIMA yielded very promising results worth mentioning.
For starters, despite being trained in many different games in parallel, on average SIMA performed better than agents specialized in playing one single gameā¦ in that specific game:
More impressively, across several different games, the agent achieved non-trivial performance results with zero-shot tasks in most of them while again beating the specialized agents:
This meant that even when put in a previously unseen environment, the model performed well, and in some instances like in the case of Goat Simulator 3, it outperformed the specialized agent (the agent only trained on playing that game).
But what does this all mean?
Simply put, we are observing tangible evidence of knowledge transfer among games.
In other words, the model learns meaningful skills from some games that can be applied effectively to others (like learning to move with the keyboard).
Whatās more, these skills are of such high quality that this generalistic agent beats the specialized ones in many games, signaling that this generalistic approach helps them learn superior skills to be applied across environments.
Very impressive results overall, that gain importance when put into perspective with Figure AIās developments.
A Great Week for Robotics
This has been a great week for robotics AI in general.
On one side, Figure AI proves we are getting better at building humanoids that can inform an increasing range of manual tasks.
On the other hand, SIMA showcases that we are starting to see our first generalistic agents across 3D environments.
But what we realize is the potential for synergies.
We might not be there yet in terms of taking these agents into real-life situations, but the convergence among these two fields is the next natural step; SIMA as the training ground, Figure AI robots as the embodiment of the generalistic agents.
And with other companies like Covariant launching their own takes on embodied intelligence (link below), itās clear that many incumbents feel that humanityās technologies are ripe enough to take on the next big challenge, deploying AI into real life.
š¾ Best news of the week š¾
ā”ļø The 100 most used GenAI apps according to a16z
ā”ļø Building a brain for robots
ā”ļø xAI to open-source Grok, according to Elon Musk
ā”ļø Covariant launches RFM-1, the LLM for robots
š„ This week on Leadersā¦ š„
We will discuss the emergence of Devin, the first-ever, fully-autonomous, AI software engineer, and the biggest news probably since the release of the original ChatGPT back in November 2022 (at least in terms of reactions).
However, we are going beyond the hype and embarrassing āthis is AGIā claims to focus on the real impact itās going to have and answer the impending question āSoā¦ what now?ā
By understanding how they built such an amazing āthingā, and reflecting on the connotations that such an event implies for the future of human labor, we will uncover the key points you should expect will have immediate impact on your life.
You will also see high-signal insights from the likes of Apple, Microsoft, Deepseek, and many more.
Do you have any feelings, questions, or intuitions you want to share with me? Reach me at [email protected]