- TheTechOasis
- Posts
- RT-2, The World's Most Intelligent Robot
RT-2, The World's Most Intelligent Robot
đ TheTechOasis đ
One of the good things about ChatGPT becoming superintelligent is that, at least, it doesnât have a physical form.
That way, at least it canât come after humanity if it goes rogue, right?
But now, a group of researchers in Google DeepMind have combined the power of visual-language models (VLMs) with robotics to create an armed robot that understands language and executes actions based on it.
Dubbed RT-2, it has set a new state-of-the-art for robotics in terms of generalization and emergent capabilities, a step-function increase over everything we had seen until now.
And to pull this off, theyâve created a new class of AI models.
A new model class
VLAs are visual-language-action models that are able to output actions a robot can then use to perform the desired movement.
Of these, RT-2 is the first of its kind, capable of ingesting images and text, and outputting not only text but robotic actions too.
In other words, RT-2 is multimodal, the new paradigm for Generative AI that aims to create models that combine multiple modalities (text, images, video, etc.) into one unique model, a necessary step toward AGI and, ultimately, superintelligence.
Of course, due to the impressive nature of current language models like GPT, researchers decided there was great potential in using these web-scale models for robotic applications.
The reason for this was simple, creating a robotic model from scratch using only robotic data was impossible given the scarcity of such data.
However, when trying to use web-scale LLMs or VLMs for robotic applications, the results were, to put simply, mild.
Specifically, the capacity of these models to generalize and perform well in unseen environments or with unseen objects was really bad.
But then researchers at Google came to a realization.
What if we use an {image + text => text} VLM and we co-fine-tune it with both visual-language tasks (like answering questions about images) and robotic data, to teach the model to output robot actions besides text?
And that my friends is what we can now call VLAs.
But what are they really?
An action predictor
Just like LLMs like ChatGPT, visual-language models do the same thing, they learn to predict the best next token based on the provided input.
For LLMs, you give them a text sequence, and they end it with more text.
For VLMs, you give them an image and text and they give you text back too.
Now, DeepMind has trained RT-2 to be capable of not only outputting text, but also outputting robotic actions that would then be used by a robot as input, as you can see below:
In laymanâs terms, itâs ChatGPT equipped with a camera that provides the observations, and predicts the action tokens the robot has to perform based on the camera observations.
In case wondering, weâre talking about an 8DoF robot, with 6 degrees of freedom for movement, 1 for gripper extension, and another for action termination.
But did they succeed?
Generalization and emergence
With VLAs, humans have achieved a new breakthrough in robotics, creating the first robot that combines superior, Internet-scale semantic knowledge obtained by the pre-trained VLM and combined it with robotic actions to create a robot that understands assignments and performs the desired actions.
Whatâs more, it does this by completely obliterating previous state-of-the-art baseline models in generalization and emergence, as you can see in the following graph (RT-2 models depicted in green and blue):
In laymanâs terms, RT-2 has proven to be very successful at performing well in situations it had never seen in training (generalization) while also developing new unexpected capabilities learned thanks to the scale of linguistical knowledge it has thanks to the VLM (emergence), both unparalleled in the field of robotics.
And the most incredible thing, itâs the first robot capable of reasoning.
To prove this point, among the more than 6,000 trials Google did, some of them include the following impressive actions:
If we take the top-second-left example, the robot is capable of understanding the bad position of the second bag of chips, allowing it to respond to the request perfectly.
To perform this action, the model required dense semantic knowledge of the images captured by the camera, understanding that the bag at the tip of the table might fall.
Similarly, in the bottom right example, the model not only understands the difference between a donkey and an octopus, but itâs also capable of reasoning that âland animalâ refers to the donkey, not to the octopus.
This is completely unseen in todayâs AI world.
Addtionally, Google evaluated the model quantitatively along three benchmarks:
Symbol understanding
Reasoning
Human recognition
Amazingly, RT-2 proved capable of performing actions like the ones exemplified below:
As you can see, the model demonstrates impressive emerging semantic capabilities.
In other words, even though the addition of a visual-language model didnât allow the creation of new robotic motions, it really transferred abundant semantic knowledge to the robot, making it much more aware of intricate concepts like placement, object recognition, and logical reasoning.
The wheel of innovation keeps turning
After seeing RT-2, it wouldnât be surprising if by the end of next year manufacturing processes around the world use robots such as these to enhance their manufacturing processes.
In fact, that probably is just the tip of the iceberg in terms of what robotics is going to evolve over the upcoming months⊠and I donât know how to feel about it.
Key AI concepts to retain from todayâs newsletter:
- Visual-Language-Action Models
- The new State-of-the-art for AI Robotics
đŸTop AI news for the weekđŸ
Note: News are updated up to the 1st of August, as Iâve been out on holiday from Wednesday to today.
đž Article proves LLaMa could be more expensive than ChatGPT-3.5
đ§ McKinseyâs take on the effect GenAI will have on the US labor market