• TheTechOasis
  • Posts
  • Circus Robots, the First AI VideoClip, & Open-Source Catching Up,

Circus Robots, the First AI VideoClip, & Open-Source Catching Up,

🏝 TheTechOasis 🏝

part of the:

Welcome to the newsletter that keeps you updated on the latest developments at the cutting edge of AI by breaking down the most advanced systems in the world & the hottest news in the industry.

10-minute weekly reads.

💎 Big Announcement 💎

I know you’re in full AI Overload and tired of low-impact daily newsletters with loads of hype and little substrate to learn from.

But worry not with TheWhiteBox, a community for high-quality, highly-curated AI content without unnecessary hype or ads, across research, models, markets, and AI products.

All in digestible language. But why would you want that?

  • Cut the Clutter: Value-oriented, focusing on high-impact news and clear intuitions to extract from them (why should you care).

  • Connect with Peers: Engage in valuable discussions with like-minded AI enthusiasts from top global companies.

  • Exclusive Insights: As the community grows, gain access to special content like expert interviews and guest posts with unique perspectives.

With TheWhiteBox, we guarantee you won’t need anything else.

No credit card information required to join the waitlist.

🚨 Week’s Update 🚨

Hello once again! We have a lot to unpack this week so let’s go.

A few minutes ago from the time I’m writing this, Google Deepmind has announced the third generation of AlphaFold 3, a model that predicts the structures of proteins.

Although I will provide much more detail in Sunday’s Leaders newsletter, proteins are complex molecules essential for biological functions, and understanding their shapes is crucial for insights into biology and drug development.

I’ve said it many times. AI’s most transformative contributions will come from its capability to discover, whether that is new laws of nature of new drugs.

Continuing on healthcare, we now have factual proof that Alzheimer’s can be caused by two copies of a single gene, APOE4. But you are probably wondering what does that have to do with AI.

Well, some recent research is pointing at CRISPR, a gene editing system, as a way to treat Alzheimer’s by targeting these same gene copies, and with examples like EVO, the biological foundation model we discussed a while back, AI could be the tool that allows us to develop the appropriate proteins to perform the gene editing.

Be that as a medical assistant like Med-Gemini, or a tool for drug discovery like Gnome, our through increased understanding of our genome like EVO, or even to cure diseases through CRISPR, you can’t be bullish enough on AI’s impact on healthcare.

But, sadly, as we have discussed many times, AI will also be used to win wars. And if you are waiting for proof, here you go:

Another week, another Anduril new product. This time, it’s an Electromagnetic Warfare System known as Pulsar to identify potential threats or targets much faster.

And where’s the AI here? Well, its whole point is to deploy frontier AI at the tactical edge (in ‘edge’ devices like laptops, smartphones, or this ‘thing’).

In other words, the models are not accessed through the cloud, they are inside the actual hardware, thanks to Pulsar having on-board GPUs.

Deploying AI at the edge, be that Apple or Anduril, is a very powerful concept. The reason for this is that AI is nothing but data compression. In other words, AI is all about learning about data and packaging that compressed knowledge in files we call ‘models’.

In layman’s terms, if you can fit ChatGPT in an EW product, that EW has access to the complete knowledge that ChatGPT has without external connections.

A state of war where all deployed hardware, be that aircrafts, tanks, or EW hardware is ‘as intelligent as ChatGPT’ is approaching, fast.

But if you aren’t convinced about the impact of AI on warfare, an AI-controlled F-16 just took Air Force Secretary Frank Kendall for a ride.

Unbeknownst to many, the military has played a huge role historically in pushing technologies forward. The Internet, GPS, or drones are among the military-fueled discoveries.

And it seems AI will be no different.

Going back to the real world, plenty of things to talk about too. Microsoft has allegedly trained a new model, named MA-1, that will compete with OpenAI and Google’s flagship models.

Microsoft seems to be that partner which would make you have serious trust issues if you were OpenAI.

Investments on rivals like Mistral or Inflection, the threat of poaching the entire staff back when OpenAI fired Sam Altman, and being very open about their preference for Small Language Models (SLMs), which are the direct opposite of what OpenAI’s strategy is, are examples that prove that OpenAI really sold their soul to Microsoft back in 2019 and have complete control in the relationship.

Talking about the latter, heavy rumors surround OpenAI. For starters, it is rumored they are launching a search engine. And if their recent website facelift hints anything, it’s that it’s Google but in black.

Luckily, you won’t have to wait that long to know if this is a nothing burger, as the announcement is expected today. And doing so four days before Google’s conference is a naughty naughty thing to do.

In the meanwhile, OpenAI is becoming extremely outspoken about regulation, as a few days ago they released a statement asking for more secure AI infrastructure, and also about the idea of generating AI porn.

Let me translate what they are really saying.

“We are scaring everyone because we want regulatory capture.”

In other words, fears that companies at the model layer can’t ever be profitable unless open-source is crippled are real, and it seems these companies may resort to this to make their companies viable.

Finally, to leave the update on a musical vibe, in this week's ‘Who Should Better Start Adopting AI Tools Soon’ we have video editors, as an entire AI-generated real music video clip generated with OpenAI’s Sora was released (the music is human though, and quite good to be honest).

Personally, the more times you see it, the creepier it gets.

💎 Sponsored content 💎

ChatGPT & AI Masterclass: Learn how to research faster, automate tasks & simplify your work & life using AI & ChatGPT for FREE 

🧐 You Should Pay Attention to 🧐

  • DrEureka, Nvidia’s Circus Robot

  • InternVL, Open-Source Finally Catching up to Closed Models

🎪 DrEureka, Nvidia’s Circus Robot 🎪

This week, Nvidia has once again reminded everyone how fast AI robotics is evolving… and also on LLMs’ great potential to minimize human intervention in the training process.

In layman’s terms, AIs training robot AIs.

For that, they have trained a robot that maintains equilibrium over a yoga ball in various scenarios, a powerful indication that robots are becoming “life-ready” really fast.

But how have they done this?

Defining the Reward

Have ever wondered how AI robots are trained?

Specifically, we generate a policy that helps them take actions in a specific environment that maximize a given reward.

For example, if you want to train a model to learn to run the 100 meters, it first needs to learn to get up, walk, run, and eventually sprint, rewarding every ‘leap’ until the model figures out it has to run.

But how do we teach that? As an example, every time the model manages to be erect, you give it points. On the other hand, if it falls, you penalize it.

This is what we know as reward modeling, creating a function that rewards good actions and penalizes bad ones. Over time, the model figures out the body positions and actions that maximize the cumulative reward.

If you’re curious about how this whole process evolves over time visually, I highly recommend this YouTube channel.

This is pretty much standard, with other frontier labs like Google Deepmind using these same principles to train soccer-playing robots that get up, sprint, shoot, and even defend.

However, defining these reward functions is extremely hard.

For instance, if we think of the human body, what actions across every single muscle and joint in your hands are positive or negative to the overall outcome of, let’s say, pen-spinning?

In situations where the action is very complex, humans struggle to write down the best possible reward function that, eventually, will help the robot act accordingly.

To give a sense of this complexity, this is the reward function they eventually settled with for the ball-walking task:

And I can guarantee finding this equation is as hard as it looks. But here’s the thing, that function wasn’t written by humans.

AIs Rewarding AIs

A few months ago, the same researchers behind DrEureka released Eureka, an AI-based reward function design algorithm.

As shown below, the idea was to use a Large Language Model (GPT-4) in an iterative loop that, given an environment (the code describing the scene where the robot will act) and the task it has to perform, generated reward functions that were then evaluated in the simulated environment.

Based on the outcomes in the simulation, feedback was generated, which was then used by the LLM to generate a new reward function, until a quality threshold was met.

Through Eureka, the team achieved quite an astounding result, as they managed to train a robot to perform a pen-spinning trick, something that even the best CGI experts would struggle with.

And with almost zero human intervention.

Now, picture yourself writing down the reward function that would model the score for each finger, joint, muscle, and phalanx to achieve such movement. Undeniably, with Eureka, Nvidia proved to the world that AIs training AIs was the future of robotics.

But there was a problem. What if we wanted to take these results into the real world?

AIs in an Open World

With robotics, the hardest issues to deal with are costs and uncertainty.

Training a robot in real life requires, well, a robot that will often be very expensive, dealing with real constraints (friction, wind, temperature, etc.) and of course things breaking.

Consequently, researchers usually aim to train the model in simulation and hope for the best. In layman’s terms, they don’t test robots in real life, they expect them to just work.

To achieve this, this process known as sim-to-real requires an additional step, domain randomization.

But what is that?

If you think about the real world, it’s fueled with uncertainty. Temperature changes every second, the robot can run into a bumpy terrain, joint movement can become less smooth with use, and objects can appear on the way, among other constraints.

Thus, to maximize the chances of success, researchers introduce randomized variations to a set of constraints (like stiffness or friction) to train the model in simulation across different environment scenarios and increase its robustness.

However, figuring out what values to define for each constraint is extremely hard.

So the researchers posited, could LLMs help us here too?

But here’s the thing, domain randomization search space, or the number of possible values each environment constraint can get, is infinite, making the problem very hard for LLMs.

To mitigate this, they performed a light search first, known as RAPP (Reward Aware Physics Prior) which essentially narrowed down the possible constraint value ranges to values that were physically possible ( i.e. negative friction values aren’t possible).

This narrows the search space for the LLM to generate the different domain conditions.

This step is particularly important, as Reinforcement Learning robots tend to ‘hack’ the reward, even generating unrealistic actions. This short video will give you an idea.

Once RAPP boundaries were set, they again used GPT-4 to generate plausible values for each constraint to create new scenarios (domain randomization), giving us the entire DrEureka framework:

Overall, DrEureka worked as follows:

  1. A task and safety instructions are defined, as well as the environment’s conditions

  2. RAPP is performed to constrain the value ranges for each condition

  3. The LLM then generates possible-yet-realistic scenarios for each environment (domain randomization)

  4. In parallel, another LLM generates the reward function based on the conditions for that specific trial.

  5. The robot is trained using the policy that maximizes that reward, generating feedback to iterate over steps three and four until a quality threshold is met

  6. Finally, the robot was deployed in reality. And the results of this speak for themselves, as shown in the videos.

What We Think

To say Nvidia is betting hard on robotics is an understatement. In fact, Jim Fan, one of the researchers in this paper, is also leading project Gr00t, an Nvidia landmark project to develop a foundation model for real-world agents.

Their results offer an exciting new perception of the speed of development in the AI robotics space, which seems to be accelerating based not only on DrEureka, but also on other companies like Figure AI, 1X, or Field AI.

And importantly, all seem to be using Multimodal LLMs as the brains of the system.

🔥 InternVL-1.5, Open-Source Catching Up 🔥

While open-source text-only Large Language Models (LLMs) are almost at closed-model level, with LLaMa 3 405B (still to be released) potentially as good or better than GPT-4, Multimodal LLMs are lagging heavily.

Or do they?

A group of Chinese researchers has released InternVL 1.5, an MLLM that matches the performance of GPT-4V or Claude 3 OpusVision in many tasks and, in some cases, even becomes state-of-the-art, especially in OCR (recognizing texts in images) and document and chart question-answering.

Fascinatingly, this model is fully open-source, as both the weights and the training datasets have been released, something not even LLaMa models can claim.

But what did this team do to achieve such amazing results?

Why Open-Source Models Suck

Although I recommend reading my blog on what MLLM models are, here’s the gist. Using an LLM as the backbone, we connect other components, namely encoders, that allow the LLM to process other types of data besides text.

For instance, an image encoder will allow the LLM to process images too (hence why ChatGPT can now interpret images for instance).

However, in the open-source world, money is tight.

Therefore, researchers cut every possible corner to achieve ‘publishable’ results. Of course, this strategy implies that results, while good, are well behind cash-rich proprietary labs.

For example, while proprietary labs have all components trained in a combined fashion, open-source teams take open-source components and stitch them together with the use of an ‘adapter’, which usually is an MLP layer.

Then, while keeping both the encoder and the LLM frozen (they are pre-trained but aren’t updated during the MLLM training), the team just trains the adapter.

But what does the adapter do?

In layman’s terms, if we have an image encoder, the encoder processes the image and then the adapter transforms that information into LLM-readable tokens.

Think of an adapter as taking in an image and transforming it into an image caption. That way, the text-only LLM can ‘see’ the image, even if it’s just “seeing the caption”.

Additionally, to cut costs even more, open-source models usually only accept a single image size resolution. Therefore, if the provided image isn’t of that exact size, they resize it.

Imagine you feed it a high-definition image and the model treats it like a 300-by-300-pixel image, information will be lost.

So, how did the team handle these constraints?

Huge Encoder, Huge Results

InternVL 1.5’s innovations can be summarized as follows:

  1. Dynamic input resolution

  2. Entire model training

  3. High-quality dataset

For starters, they designed the model to accept any image size by defining preset size relationships (1:3, 2:3, etc.).

Thus, every image gets matched to a specific size and resized that way. Images are still resized, but without compromising too much information.

Also, unlike in common training methods, the entire model is trained during the MLLM phase, not only the MLP projector (adapter).

But to keep the entire training in budget, they trained the model progressively with different curated datasets and different LLM models, so that the overall costs were smaller, in a similar fashion to Li et al.

Finally, they curated a bilingual (English/Chinese) dataset that enhanced the model’s language capabilities while increasing its performance on text-image data.

Based on all this, how does the overall process work?

  1. An image and a text task (like ‘What animal is that?’) are provided

  2. The model then chunks the image into equal-sized patches (the number depending on the total size)

  3. The adapter then transforms the image tokens into text tokens, which are then processed by the LLM to generate the response.

And that’s it, the first state-of-the-art fully open-sourced model.

What We Think

As Yann LeCun, Chief Scientist at Meta, has always predicted, open-source is catching up, and InternVL is great proof of that.

As mentioned, InternVL 1.5 obtains unmatched performance, even when compared against the big three (ChatGPT, Claude, Gemini), and performs comparably in most other benchmarks, signaling that the gap between closed and open-sourced is finally closing.

That said, some forces could soon change this.

Growing concerns around the US’ approach to regulation are rising, as frontier labs are on a crusade to force AI to be treated as dangerous as nuclear weapons, even arguing for the equivalent of a Federal AI Administration, similar to the FDA for food and drugs, something heavily hinted at in Biden’s AI Executive Order.

Such a regulation would enforce prohibitive constraints over AI builders/engineers, and even enforce criminal liability upon developers in case their models incited unlawful actions.

🧐 Closing Thoughts 🧐

This week has been a bit of a mixed bag.

We have seen huge developments in healthcare, robotics, and open-source MLLMs, but the increasing role of AI in far more dangerous territories like warfare or the looming threat of watertight regulation to suffocate all but the incumbents, is also becoming very apparent.

For those reasons, staying up-to-date on the industry's state is critical, and I bet you agree with me.

Do you have any feelings, questions, or intuitions you want to share with me? Reach me at [email protected]