RT-2, The World's Most Intelligent Robot

🏝 TheTechOasis 🏝

One of the good things about ChatGPT becoming superintelligent is that, at least, it doesn’t have a physical form.

That way, at least it can’t come after humanity if it goes rogue, right?

But now, a group of researchers in Google DeepMind have combined the power of visual-language models (VLMs) with robotics to create an armed robot that understands language and executes actions based on it.

Dubbed RT-2, it has set a new state-of-the-art for robotics in terms of generalization and emergent capabilities, a step-function increase over everything we had seen until now.

And to pull this off, they’ve created a new class of AI models.

A new model class

VLAs are visual-language-action models that are able to output actions a robot can then use to perform the desired movement.

Of these, RT-2 is the first of its kind, capable of ingesting images and text, and outputting not only text but robotic actions too.

In other words, RT-2 is multimodal, the new paradigm for Generative AI that aims to create models that combine multiple modalities (text, images, video, etc.) into one unique model, a necessary step toward AGI and, ultimately, superintelligence.

Of course, due to the impressive nature of current language models like GPT, researchers decided there was great potential in using these web-scale models for robotic applications.

The reason for this was simple, creating a robotic model from scratch using only robotic data was impossible given the scarcity of such data.

However, when trying to use web-scale LLMs or VLMs for robotic applications, the results were, to put simply, mild.

Specifically, the capacity of these models to generalize and perform well in unseen environments or with unseen objects was really bad.

But then researchers at Google came to a realization.

What if we use an {image + text => text} VLM and we co-fine-tune it with both visual-language tasks (like answering questions about images) and robotic data, to teach the model to output robot actions besides text?

And that my friends is what we can now call VLAs.

But what are they really?

An action predictor

Just like LLMs like ChatGPT, visual-language models do the same thing, they learn to predict the best next token based on the provided input.

For LLMs, you give them a text sequence, and they end it with more text.

For VLMs, you give them an image and text and they give you text back too.

Now, DeepMind has trained RT-2 to be capable of not only outputting text, but also outputting robotic actions that would then be used by a robot as input, as you can see below:

In layman’s terms, it’s ChatGPT equipped with a camera that provides the observations, and predicts the action tokens the robot has to perform based on the camera observations.

In case wondering, we’re talking about an 8DoF robot, with 6 degrees of freedom for movement, 1 for gripper extension, and another for action termination.

But did they succeed?

Generalization and emergence

With VLAs, humans have achieved a new breakthrough in robotics, creating the first robot that combines superior, Internet-scale semantic knowledge obtained by the pre-trained VLM and combined it with robotic actions to create a robot that understands assignments and performs the desired actions.

What’s more, it does this by completely obliterating previous state-of-the-art baseline models in generalization and emergence, as you can see in the following graph (RT-2 models depicted in green and blue):

In layman’s terms, RT-2 has proven to be very successful at performing well in situations it had never seen in training (generalization) while also developing new unexpected capabilities learned thanks to the scale of linguistical knowledge it has thanks to the VLM (emergence), both unparalleled in the field of robotics.

And the most incredible thing, it’s the first robot capable of reasoning.

To prove this point, among the more than 6,000 trials Google did, some of them include the following impressive actions:

If we take the top-second-left example, the robot is capable of understanding the bad position of the second bag of chips, allowing it to respond to the request perfectly.

To perform this action, the model required dense semantic knowledge of the images captured by the camera, understanding that the bag at the tip of the table might fall.

Similarly, in the bottom right example, the model not only understands the difference between a donkey and an octopus, but it’s also capable of reasoning that “land animal” refers to the donkey, not to the octopus.

This is completely unseen in today’s AI world.

Addtionally, Google evaluated the model quantitatively along three benchmarks:

  • Symbol understanding

  • Reasoning

  • Human recognition

Amazingly, RT-2 proved capable of performing actions like the ones exemplified below:

As you can see, the model demonstrates impressive emerging semantic capabilities.

In other words, even though the addition of a visual-language model didn’t allow the creation of new robotic motions, it really transferred abundant semantic knowledge to the robot, making it much more aware of intricate concepts like placement, object recognition, and logical reasoning.

The wheel of innovation keeps turning

After seeing RT-2, it wouldn’t be surprising if by the end of next year manufacturing processes around the world use robots such as these to enhance their manufacturing processes.

In fact, that probably is just the tip of the iceberg in terms of what robotics is going to evolve over the upcoming months… and I don’t know how to feel about it.

Key AI concepts to retain from today’s newsletter:

- Visual-Language-Action Models

- The new State-of-the-art for AI Robotics

👾Top AI news for the week👾

Note: News are updated up to the 1st of August, as I’ve been out on holiday from Wednesday to today.

💸 Article proves LLaMa could be more expensive than ChatGPT-3.5

🧐 McKinsey’s take on the effect GenAI will have on the US labor market