OpenAI and Google Present your Future Coworkers
🏝 TheTechOasis 🏝
Breaking down the most advanced AI systems in the world to prepare you for your future.
5-minute weekly reads.
AI Research of the Week: Google Deepmind presents a new robotics paradigm
Leaders: Unveiling the Game-changing Potential of GPT-4Vision
🤯 AI Research of the week 🤯
One can safely say that Google Deepmind is to robotics what OpenAI is to Large Language Models.
And even though AI-based robotics generates as much fear as hype due to the intrusive notion of humanity creating highly intelligent AI models embodied into the physical realm, this week we have seen proof that the field is steadily approaching its ‘ChatGPT’ moment.
Whether we like it or not.
And their new models, the RT-X family, are a statement of fact that AI has reached a point of no return.
A multiple-body motion predictor
Some months ago in this newsletter, we saw the case of RT-2, Google Deepmind’s state-of-the-art robotic arm.
RT-2, the universal arm
This model, a first of its kind, is a VLA (Vision-Language-Action) model that, given a video frame and an instruction, predicts the movements required from its actuator to perform the instruction based on the observations of its camera.
Source: Google Deepmind
The model is comprised of a Vision Transformer and an LLM.
The former processes the images, and the latter text, both encoding their respective inputs into a common embedding space so that the LLM could then decode the actions the arm had to execute.
This was a first in the industry, in the sense that we had managed to positively transfer the knowledge a model learned from text and images into a robotic arm that would then be much more capable of processing natural language and vision instructions.
In summary, RT-2 is a model that, given a complex instruction in natural language, will understand the task and successfully predict the required actuator moves to move the arm and complete it.
As you can see in the image above, RT-2 performed very semantically complex tasks such as ‘pick land animal’, correctly choosing between a horse figure and an octopus fluffy toy.
Much more than a robotic arm
For all these features, RT-2 was never meant to be a simple new robotic arm model, but a paradigm change.
It represented the effort of the industry in doing for robotics what has been achieved for natural language and vision… a general-purpose model.
And, in a way, RT-2 was the first stone in building the pavement in that direction.
Now, RT-X is a new iteration of this vision. And what an iteration that is.
A multibody approach
When announcing RT-X, Google also announced the release of the Open X-Embodiment repository, a cross effort of up to 21 universities and institutions in the world to assemble the biggest robotics dataset known to date, with many different types of robots, or embodiments, being a part of it.
Naturally, RT-X is nothing but the fine-tuning of RT-2 using a considerable portion of this dataset.
The objective was clear:
A new emerging beast
When comparing the RT-X models (they also trained an RT-X version of their first model, RT-1-X) to standard robots, the results speak for themselves.
In almost every case, the model beat their counterparts.
This is particularly impressive if we consider the fact that all these models, besides RT-Xs, are task-specific, meaning that even though RT-X models are general purpose, they nevertheless beat the models in the field they were specifically and uniquely trained for.
Also, we can see the dramatic improvement that RT-1-X shows against its previous version, RT-1, even though the architecture is identical.
This puts into clarity how relevant the data used to train a model, exhibiting the Open X-embodiment dataset’s value and validating the fact that knowledge was indeed transferred to the base model.
But the most impressive thing when evaluating RT-X models was its emerging skills.
Understanding the nuances of words
RT-2-X can be claimed as the first model that truly understands the nuances of language.
If we talk about the absolute movement of objects in a certain place, the model correctly identifies exact positions.
When referring to movements relative to another object, the model is ‘intelligent’ enough to understand what the different objects are, as well as complex requirements like moving something ‘in between’ two objects.
But even more impressively, RT-2-X is capable of understanding subtle changes in the wording of a phrase, such as changing the word ‘into’ for ‘near’; here the required motion is completely different, but the robot is still capable of interpreting the instruction in the correct way.
The Dawn of Robots
RT-X-2 is a step-function increase that obliterates the previous state-of-the-art, its base model RT-2.
And when you put into perspective the fact that RT-2 was published not even three months ago, you realize that the speed of development doesn’t make sense anymore.
Although this may now be apparent to you, it’s not apparent at all for laypeople, which is a scary predicament considering how dramatically our lives are going to change in the next few years.
And while researchers and enthusiasts alike toy with the idea of creating AGI, embodied intelligence is now approaching, meaning that these intelligent machines could be around us, see us, hear us, speak to us, and even touch us.
But is humanity prepared for that?
RT-X represents the first AI robotics model family that successfully incorporates knowledge from 22 different robots, assembling a multi-faceted, general-purpose robot
RT-X dramatically increases emerging skills (skills unseen in the training data) achieving great performance in out-of-distribution examples
🔮 Practical implications 🔮
This is the future of manufacturing, nothing more, nothing less
Soon enough, open-vocabulary robots could be used in consumer-end tasks like helping in the kitchen, cleaning, or ordering stuff
👾 Best news of the week 👾
😍 Google’s new AI photo-editing feature is simply amazing
🥇 Leaders 🥇
This week’s issue:
Unveiling the ‘godlike’ powers of GPT-4Vision and picturing the future ahead of us
Last week, OpenAI announced that they were finally ready to release their long-awaited GPT-4Vision, or GPT-4V.
And it’s safe to say that this multimodal version completely rewrites what a top AI model will look like from now on.
GPT-4V, besides instantiating what we should now define as Large Multimodal Models, or LMMs, will not only offer us new techniques like visual reference prompting, but it will also help AI understand our world better.
Now, ChatGPT not only will have the capacity to read text, but it will also have eyes (and mouth, ears, and hands too) meaning that AI is now capable of seeing our world “just like we do”, meaning that a new set of emerging capabilities, unknown even to the creators of these models, are coming.
Bottom line, today we are doing five things:
Uncovering GPT-4V’s new communication interface
Understanding the impressive visual reference prompting feature
What new tasks GPT-4V excels in
Unraveling how users and companies will leverage GPT-4V in the short term
Predicting the future around LMMs
For all this, I believe we are in a critical time for humanity.
In the future, I believe that we will look in hindsight to the release of GPT-4V and view it as the moment humans began their transition to a world where a decent part of your friends and coworkers won’t be human anymore.
Don’t believe it? Just wait and see the examples I’m about to show you.
Mirror mirror on the wall, who’s the fairest of them all?
The first big change that GPT-4V brings is the way we can communicate with it.
We can do so in three ways:
Single image with text inputs
Multiple image with text inputs
In layman’s terms, you can prompt the model by sending it a text, an image, or a combination of both.
The first requires no introduction, as it’s how you usually prompt GPT-4.
The second one allows you to send it an image and a text instruction (or a text that accompanies the image).
As you can see, the model is capable of successfully understanding complex patterns in images and suggesting the right answer.
This is impressive but only the beginning.
As mentioned earlier, you can also interleave multiple images and text as part of the prompt. This is something that is not new, but previous multimodal models required a very strict structure in how you combined both modalities.
With GPT-4V you can explain the thought process you need the model to perform as naturally as you would with a human.
But, if you watch carefully, you might have noticed something “different”.
In the previous example, the user started by drawing a circle around a specific part of the image.
This is visual reference prompting, a new feature of GPT-4V and, therefore, ChatGPT that is, literally, leaving researchers and users alike speechless.
The New Communication Interface
With GPT-4V, you can focus the attention of the model to specific parts of the image.
This ia very important feature, as one of the things that makes ChatGPT so revolutionary is the fact that it democratizes AI to the masses, so it’s fundamental to guarantee capabilities that allow non-experts to communicate efficiently with them.
But GPT-4V is not only capable of accurately describing an image, it can go far and beyond to describe higher representations related to that image like humor or scene interpretation.
In here, the model goes beyond the objective description of the image and understands the actual meme, correctly identifying the underpinning concept of ‘procrastination’ that gives sense to the meme.
But if that impressed you, this next image will blow your mind.
Here, the model had to understand a very complex relationship between exams, handwriting, and time, and understand the decaying quality over time of the handwritten text until it became something similar to a line graph over the course of a handwritten exam.
In my opinion, this joke would go over the heads of many people, and yet GPT-4V gets it.
Moreover, GPT-4V can also understand complex object placement and scenes.
Here, the model not only acknowledges what each object is, but it also takes into consideration the placement of those objects in the image.
Also, it’s capable of linking those objects to the figure of a ‘student’ and complex associations of students with ‘untidiness’, ‘remote work’, and even goes as far as inferring the climate where this person lives, going well beyond what a person would suggest when seeing the image.
At this point, GPT-4V is already in a league of its own.
But recent explorations of the model prove that the features it offers go well beyond, with some advanced features that have, literally, broken the Internet.
Subscribe to Leaders to read the rest.
Become a paying subscriber of Leaders to get access to this post and other subscriber-only content.
Already a paying subscriber? Sign In