• TheTechOasis
  • Posts
  • A Self-Correcting Humanoid, Grokking & a Tool that Works

A Self-Correcting Humanoid, Grokking & a Tool that Works

In partnership with

THEWHITEBOX
TLDR;

  • 📰 News from FigureAI, OpenAI, and NapkinAI

  • 🧐 Insights on semiconductors, AI economics, and a rundown on what matters in AI today

  • 😍 Trend of the Week: Grokking

The fastest way to build AI apps

  • Writer Framework: build Python apps with drag-and-drop UI

  • API and SDKs to integrate into your codebase

  • Intuitive no-code tools for business users

NEWSREEL
Figure 02, The Self-Correcting Robot

Without a doubt, in the biggest news of the week, Figure.ai has announced its newest version of its humanoid, with a cleaner look and improved intelligence, to the point that this robot is allegedly self-correcting, which means it can correct its own actions.

The video includes a rundown of the new features and shows the new humanoid performing actions in a BMW factory, which is being tested for eventual deployment as an autonomous factory worker.

TheWhiteBox’s take:

There are many things to explain here.

For starters, these robots run on OpenAI intelligence; they have a multimodal vision language model (probably a tailored variation of GPT-4o) that takes images from its 6 cameras and the robot’s proprioception (the position and angles of its 16 different joints at any given time) to decide the next move. For more details, read my article on HumanPlus by Stanford.

Importantly, it performs speech-to-speech reasoning, meaning it can take instructions by voice and execute the actions while explaining what its doing.

Undeniably, the self-correction mechanism is the most impressive thing. Although they don’t provide details on how this works, this behavior is almost guaranteed to be possible just in very specific situations (like the one shown in the video).

The model has possibly memorized certain poses (what the cameras see and the position and angle of the joints must be a specific pose, just like HumanPlus did) and essentially moves itself toward that pose, correcting when necessary.

On a final note, I’m very impressed with the energy the humanoid consumes, around just 2.25 kWh. For reference, a standard hairdryer consumes around 1.2-1.8 kWh, so from an energy perspective, this humanoid is ‘household ready.’

OpenAI’s Structured Outputs

OpenAI has announced structured outputs, a new feature that allows you to impose the structure of the JSON output on the model’s API via a JSON Schema, a critical step forward in building accurate agents.

In other words, in situations where you need the model’s response to be of a specific format (a critical requirement to connect LLMs to third-party APIs so that the model can perform actions on other systems), you can specify the ad hoc structure and the model will comply with uncanny accuracy.

TheWhiteBox’s take:

It’s been almost a year and a half from OpenAI’s last model leap release with GPT-4. Since then, the start-up that was allegedly conceived to ‘build AGI’, seems much more focused on product than on its supposed vision (this is precisely what Elon is arguing in its latest lawsuit against them).

In fact, according to some, they have only released worse models since the launch of GPT-4, which would be devastating for their brand if true.

That said, the structure output features are great news for all of us, as the industry has wasted all its hype bullets, and investors are finally taking the ‘show me results’ stance.

But if we take a more cynical approach after seeing the latest delay of GPT-5, could it be that OpenAI is incapable of making a great leap with the next generation of models?

Let’s not forget that most (if not all) of the market markup given to AI companies is based on future expectations rather than current reality. How will markets react if GPT-5 underdelivers?

In the meantime, in very much OpenAI’s style, Sam Altman basically confirmed Project Strawberry (which I covered previously here) with a tweet showing a strawberry garden. They are hype masters, even though people aren’t buying it anymore.

PRODUCT OF THE WEEK
Napkin AI

I firmly believe AI will be a great productivity unlocker. Thus, I’ve decided to share useful AI products with you whenever I come across them, even if they don’t sponsor this newsletter.

Recently, I’ve come across a product (again, not sponsored) that I’ve found extremely useful (I will use it quite a lot from now on; I’ve used it today) that turns text into graphs, even offering refinements if necessary. It’s not perfect, but it’s already pretty useful, which is quite the statement in Generative AI these days.

But don’t take my word for it. You can try it for yourself at this link for free.

THE LEARNING CORNER
Recent Insights

TREND OF THE WEEK
Grokking, a New Form of Reasoning

Here’s a bold statement for you: This week’s trend of the week article will make you more ‘AI smarter’ and, in the meantime, will shed light on the crude reality of frontier AI models today: as you’ll see today, they are actually pretty dumb.

Despite this, you are probably tired of countless statements on how intelligent Large Language Models (LLMs) are, but it’s all an overblown fallacy.

While they can be incredibly useful with techniques like Augmented LLMs described in this newsletter, they aren’t the smarty pants the news will tell you.

But a technique called grokking might change that in the most spectacular way: by training small toy LLMs to be much smarter than frontier AI models that cost hundreds of millions of dollars to train.

But how is that possible?

Confusing Memorization with Reasoning

There are four ways in which an AI can reason.

Four Options, A Harsh Reality

  1. Implicit reasoning: When the model has built internal reasoning circuits that allow it to perform ‘intrinsic reasoning’ after seeing multiple examples during training. It’s similar to driving your car; you perform intelligent actions unconsciously; they are natural to you.

  2. Verbalised reasoning: When we induce the model to explicitly verbalize the reasoning, like asking it to ‘solve the problem step-by-step.’ Making the model ‘explain its thoughts’ leads to better reasoning.

An interesting fact for you. Anthropic’s Claude models are prompted by the system to generate ‘thoughts’. These are parts of the response to the user included in <thought> clauses, which are hidden to the user but help it respond better.

  1. Few-shot reasoning: When we provide the reasoning chains as context. For instance, when we give examples of ‘how to reason’ as part of the prompt. By providing examples to imitate, that’s precisely what the model will copy.

  2. Active search reasoning: When we allow models to generate multiple possible solutions and verify them until it settles for the best one.

Today, most of our efforts revolve around the second and third options. However, in both cases, the results are pretty mediocre and, as we will see in a minute, outright inferior to the method we are seeing today.

Another promising path is search-enhanced reasoning, a way of ‘giving models time to think.’

Long considered the holy grail of LLMs, I provide much more detail in a previous newsletter, but the idea is that by allowing models to explore the space of possible solutions, they are imitating the same procedure we as humans do when thinking a problem, a term psychologists refer to as ‘System 2 thinking,’ or ‘conscious thinking.’

However, the sheer scale of compute and memory requirements that this paradigm requires makes it today a wet dream far from reality.

So, with options two and three being very limited, and the fourth being too expensive, this leaves us with the first option: implicit reasoning.

Can AI implicitly reason?

While approaching System 2 through search undoubtedly needs to happen eventually, we still have yet to train good ‘intrinsic reasoners,’ models that are intrinsically smart or, more specifically, models that can effectively reason over their “known knowns.”

One way to differentiate System 1 & 2 of thinking is that System 1 is performing intelligent, reasoned actions over past experience and knowledge, while System 2 requires active search reasoning to find an answer that is not natural or automatic to you.

And I have plenty of proof that current models can’t reason over their knowledge. In fact, you can check it for yourself.

For instance, one common reasoning type they always fail is composition reasoning, when they are provided with two related facts, and the model has to infer a third one.

For instance:

  1. Barack Obama’s wife is Michelle.

  2. Michelle was born in 1964.

Seeing this, if we ask the model to infer when Barack’s wife was born (1964), only a few get it right. But if we turn it up a notch, things get pretty bad.

Here’s a prompt all LLMs today fail on the first try (and I mean all of them):

“Alice has 10 brothers and 10 sisters. How many sisters does any of the brothers have?” Again, this is a composition exercise. The model has two facts, and a third fact must be inferred based on those two.

Here’s the answer GPT-4o gave me:

You see what’s going on, right? The model can’t reach the conclusion that Alice must be counted as part of the sister group.

Next, let’s try to lure it into the answer by asking it to take a step-by-step approach, aka the “verbalizing reasoning” method (the classical chain-of-thought method):

Notice the pattern, right? The model has not figured out that Alice is also one of the sisters, despite being obvious based on the stated facts I gave it.

So, if they aren’t capable of making such simple reasoning, we quickly realize that LLMs are just parroting things without actually understanding them.

Therefore, have we been continuously gaslighted by Tech tycoons into thinking LLMs are much smarter than they are?

Long story short, yes, but we could make LLMs truly smart with grokking.

Grokking, Becoming One with Data

Just like any other AI, LLMs are trained using a standard procedure.

A Decades-long Method

  1. We gather as much data as possible and divide it into two categories: training and testing.

  2. We train the model on the training set and then test it on ‘unseen’ testing data. If the model performs well on the test data, it means that it has learned useful stuff from training that allows it to perform well in data ‘it has not seen before.’ This idea is called ‘generalization.’

Traditionally, in this procedure, the key thing to avoid is ‘overfitting,’ a phenomenon that occurs when we extend the training so much that the model basically memorizes the training data, making it incapable of generalizing to the test set.

By training on data for too long, we lure the model into absurdly-specific conclusions. For instance, if the model has just seen husky photos and overfits, it will only accept other huskies as ‘dogs’.

Thus, historically, researchers have aimed to find the sweet spot between under/overfitting. Using a dog example again, we aim to find the moment when the model learns general dog attributes (four-legged, hairy, paws…) as characteristics of all dogs instead of assuming that all dogs look like huskies.

Generated by author

But there’s a problem. If we only mildly expose the models to the data to prevent overfitting, the models only seem to be learning the data's ‘gestalt.’

And here’s where grokking comes in. It is one of the most counterintuitive ideas in AI today, but it is potentially ground-breaking.

The Art of Grokking

Grokking breaks completely with the previous method by extending training well beyond overfitting.

At first, this might seem like a crime, but it hides a magical secret: with enough grokking, the model will learn to generalize to depths far exceeding those of a non-grokked model.

As we can see in the image below, when we overfit a model (the moment when the red curve reaches 100% accuracy), the accuracy on the test set is very low.

However, if we continue training on the overfitted data, suddenly, the model starts performing well on the test set, and, in the case of reasoning comparison problems (right), the model even learns to perform out-of-distribution comparisons (more on this in a second).

And this overextension of the training is what we call grokking. At first, this might not make sense. How does a model that has memorized the training data perform well on different data?

The shortest answer is that the model has become ‘one with the data.’ But what does that mean?

Giving Models Time to Generalize

Just like allowing LLMs to search for different possible ways of solving a problem before answering is akin to giving models ‘time to think,’ grokking can be seen as a way to give models ‘time to generalize.’

The issue with the traditional training method is that if we reach overfitting and stop training, the model has indeed memorized the training data. Thus, we simply assumed that overfitting was something to avoid.

But if we extend the training, akin to making a model ‘bang its head against data it already knows,’ something magical happens. Over time, the model learns simpler ways to reach the same conclusion.

On a more technical explanation, most AI training today revolves around regularization, akin to Occam’s Razor. In layman’s terms, the simpler the solution to a problem (the fewer assumptions you need), the better that solution is, as the model learns only to keep the key elements that dictate a solution.

As with all things, what is going on in grokking is better understood with an example. Let’s say you train a model to identify cats in photos and, to do so, you provide it with:

  • 99 photos of dark-colour cats that are four-legged, hairy, have whiskers, slit-shaped eyes, and a tail,

  • and 1 photo of a hairless cat (like a Sphynx).

If the model overfits the data, it will just assume that all cats are dark-colored, four-legged, hairy, and have whiskers, slit-shaped eyes, and tails.

And the Sphynx? It has overfitted the data, so at first, it will still argue that it’s also a cat, but without a better explanation than ‘just because.’

But if we continue to train the model on this data for longer, the model eventually finds a simpler answer to: ‘What is a cat?

For example, it might reach the conclusion that being small, having whiskers, slit-shaped eyes, and a tail are sufficient conditions for labeling the animal as a cat, with being four-legged, hairy, and being a specific dark color redundant information.

As this solution is simpler yet sufficient, it has a higher chance of generalizing well. Indeed, despite not being hairy or dark in color, the Sphynx is also a cat under the simpler definition the model reached during grokking.

In a nutshell, by seeing the same data repeatedly, like a human reading a research paper multiple times (this is me with this paper, lol), the model eventually develops a more holistic, sound, and simple understanding of the data.

Thus, with grokking, models go from memorizing to truly understanding data.

For a more technical explanation to why grokking works, I will upload a more tech-oriented explanation into the Premium Notion site.

But does grokking ‘walk the talk’? Yes, it does.

Unlike frontier AI models, a simple grokked GPT-2 level model (a model from 2019, prehistoric by today’s standards) completely obliterates GPT-4 Turbo and Gemini 1.5 Pro (despite both using retrieval augmented generation to improve results), in the following task:

Being provided with data and relationships between many related people, answering the question: Who’s younger, John or Rick?

While the minute grokked Transformer achieves perfect accuracy, frontier models fail miserably.

In conclusion, the data around grokking is impossible to ignore, and I expect grokked models to become a common theme shortly, a key trend we will all soon hear a lot about.

TheWhiteBox’s take

Technology:

From a purely technological standpoint, this is a revolution—not in the architectures used, but in how we train models.

Seeing how a grokked GPT-2 level model (1.5 billion parameters) destroyed frontier models in reasoning problems, models up to a thousand times larger, I wouldn’t be surprised if many labs soon release grokked models in the 2-to-10 billion parameter range.

Products:

Although distillation is the most common method for training small models today (teaching small models to imitate large ones), I would not be surprised if a purposely grokked product comes out before the end of the year.

Considering Elon Musk’s model is called ‘Grok,’ in a clear reference to this method, could Elon be preparing a grand entrance?

Markets:

No. Markets are NOT pricing in grokked models.

Grokked models could be the ultimate compression factor, allowing the creation of powerful reasoners that are orders of magnitude smaller than frontier models and, thus, much cheaper to run.

If this paradigm materializes, many investors will question whether the insane $210 billion in yearly CapEx (projected from last quarter's numbers below) that Big Tech is spending on AI hardware and data center land and equipment is worth it.

However, these companies could argue that the training grokked model requires much longer training runs, so there’s still room for justification.

That said, one thing’s for sure: no investor in the world is asking this question right now, something that could give you an edge simply by reading this newsletter.

THEWHITEBOX
Premium

If you like this content, by joining Premium, you will receive four times as much content weekly without saturating your inbox. You will even be able to ask your questions you need answers to.

Until next time!

For business inquiries, reach me out at [email protected]