• TheTechOasis
  • Posts
  • Monkey LLMs, AI Drones, Grok-2, & The King's Return

Monkey LLMs, AI Drones, Grok-2, & The King's Return

THEWHITEBOX
TLDR;

  • 👏 New Premium Content on AI applications, Hyperscaler fallacy, & more

  • 📰 Hot news from Frontier AI labs, Amazing New Agents, and Google’s new smartphone AI features

  • 🏝️ Insights on AI Drone wars, rethinking our framing on AI, tech layoffs, and an interview with Google Deepmind’s CEO

  • 🆕 Trend of the Week: Large Language Monkeys that Surpass Frontier Models

NEWSREEL
LLMs, Agents, & War

This week, we have news from Google, xAI, OpenAI, Anthropic, and MultiOn.

PRODUCT
Google Presents New AI Features

Google presented a new set of AI features in their Google Pixel event this week.

Among the different features, we have call notes, which summarize your phone conversations, and Gemini Live, which allows a Gemini model to interact with many of your applications to take action on your behalf (similar to Siri).

Importantly, this does not require continuous back-and-forth between the apps (everything is executed under the hood, which is lovely).

TheWhiteBox’s take:

Although the model will fail many times (as in the live demo during the presentation), this is expected. My main concern is that this Gemini version will obviously run in the cloud.

This is a stark comparison with Apple, which at least has tried to have some of the features run in your smartphone. Quite honestly, I’m not sure I’m comfortable with my personal phone calls going to a Google server to be summarized.

That said, the use cases shown seem more complex than what Apple promised in June, so it makes sense that most of them can’t be executed by an on-device AI.

LLMs
xAI Launches Grok-2, New Frontier Model

xAI, the AI company founded by Elon Musk that recently raised $6 billion, has launched the second version of its large language model, Grok.

It’s on par with most frontier AI models, even beating GPT-4o and Claude 3.5 Sonnet in some tasks. They also released a smaller Grok-2-mini version.

They have also partnered with Flux.01, which might be considered the best image generator in the world right now, from Black Forest Labs, a company founded by the creators of the original Stable Diffusion.

TheWhiteBox’s take:

Considering how good the model is, it’s not the super colossal news you would have expected. It seems agents are sucking all the air in the room from LLMs.

In my opinion, this is just a little bit of ‘showing off’ from xAI trying to maintain momentum ahead of the release of Grok-3, which should be released before the end of the year and will be trained in the world’s largest GPU cluster, a 100,000 H100 GPU install base.

If you’re wondering how large that data center is, it will require 140 MW of power (enough to power more than 100,000 homes) and may have cost around $4 billion to build ($2 billion to buy the GPUs, another $2 billion to buy land, other equipment, and labor).

LLMs
OpenAI Reclaims Throne

After some weeks or even months, OpenAI has reclaimed the first spot as best overall frontier AI, according to LMSYS’ leaderboard, thanks to “GPT-4o-latest”.

Interestingly, this time, they have released two updates instead of one, with the codenamed ‘latest’ optimized for chat (this is the one that has claimed the first spot) and ‘gpt-4o-08-06’ intended more for API-style tasks like function-calling, instruction following, and so on, as explained by an OpenAI employee.

Still, no giant leap in intelligence in the horizon anytime soon.

ENTERPRISE
Anthropic Presents Prompt Caching

Anthropic, OpenAI’s main competitor and creator of Claude, which many consider the best AI in the world, has announced prompt caching.

In simple terms, in situations where part of the prompt repeats itself over multiple interactions (like having a conversation with the chatbot), the system detects the repetition and caches that part of the prompt, avoiding recomputing the attention mechanism for that part.

TheWhiteBox’s take:

Transformers, the architecture behind models like Claude or ChatGPT, are very inefficient. The reason is that they don’t compress state.

In layman’s terms, when it processes a sequence, it stores every word in the cache. It doesn’t remember some previous parts of the conversation, but all of them.

As you engage in a conversation with the chatbot, previous parts of the conversation must be fed back to the model to recall the previous things you said, which creates considerable redundancy.

With prompt caching, the model can store previous interactions in that conversation and avoid recomputation, significantly decreasing the response time for every new interaction as the number of computations decreases (and so do costs).

DeepSeek first introduced this idea, but it’s nice that larger labs are adopting this method.

NEWS OF THE WEEK
MultiOn Launches Agent Q

In what I consider the most impressive release of the week, MultiOn has presented a new breakthrough in web navigation agents. These agents can perform complex planning and multi-step reasoning actions that allow them to interact with and even self-correct their actions with multiple websites on command.

Fascinatingly, for the first time, it beats humans on these types of searches when equipped with online search, and boosts the performance of LLMs tremendously with examples like Llama 3 70B going from 18.6% to 95% accuracy.

The link above includes a video demonstration.

TheWhiteBox’s take:

Crucially, we cover the method that enables such impressive behavior in the article below, search. In other words, they allow models to explore different possible solution paths.

Additionally, MultiOn mentions the capacity to ‘self-heal,’ shown in the video, in which the model initially fails to act because OpenTable doesn’t show availability in the restaurant they want to book. Hence, the model performs a Google search indicating a place to book, luring the model to return to OpenTable and finish the task.

Hype aside, couple of things. As you’ll see in the article below, using LLMs as verifiers to review the model’s plans and ideas (here, MultiOn forces the LLM to self-critique its actions) has a performance plateau.

Also, if we factor in Apple’s recent ToolSandbox benchmark, which is much more prepared to evaluate an agent’s true multi-step behavior, chances are Agent Q still has essential gaps to close.

Seeing the explosion of AI agent research, is there a doubt that we are now officially in the agent era?

More on that on Sunday.

LEARN
The Insights Corner

🤖 Unreasonably effective AI with Demis Hassabis, CEO of Google Deepmind

TREND OF THE WEEK
Large Language Monkeys: Is The Best Model Always the Best Option?

Google Deepmind, Standford, and Oxford research teams have presented compelling evidence that opting for the ‘most intelligent’ LLM by default is a tremendous error.

When used as monkeys, smaller LLMs can confidently surpass the might of frontier AI models. It also offers some of the deepest insights on ‘long-inference’ models that will make you doubt every intuition you’ve built around LLMs.

In a nutshell, this week’s ‘trend of the week’ will broaden your perspective on Large Language Models (LLMs) and might even force you to rethink your understanding—and strategy—regarding Generative AI.

Large Language Monkeys

As we discussed many times, LLMs are still not that intelligent. Thus, there are two ways in which academia openly researches how to improve this.

The Levels of Intelligence

As we covered last week, LLMs are going to improve their intelligence in two ways:

  • Compression: By giving models ‘time to learn,’ they innately become better thinkers by finding patterns in data. This type of intelligence allows models to perform reasoning instinctively, quickly, and without second thoughts.

  • Search: By giving models ‘time to think,’ they explore possible solutions at runtime until they find the best one. This is a slow, deliberate, and ‘conscious’ way of solving a problem.

You may have realized these intelligence types are eerily similar to Systems 1 and 2 of thinking, the famous ‘thinking modes’ theory by the late Daniel Kahneman.

System 1 is fast and automatic; System 2 is slow, deliberate, and conscious.

Last week, we saw how grokking was one of the most exciting trends for improving compression and helping a model go from memorizing patterns to actually understanding them.

But how can we give models ‘time to think’?

The Power of Search

The infinite monkey theorem suggests that a monkey randomly pressing keys on a typewriter for an infinite duration will almost certainly produce any given text, including the complete works of William Shakespeare.

Similarly, if we gave an LLM infinite time to generate every possible solution to a problem, they would eventually find one. This premise has led to all AI labs fiddling with this idea.

But what do they mean by that? In a nutshell, whenever we ask a search-enhanced LLM to answer a given problem, the model doesn’t immediately reply.

In fact, it enters a stage of deep thought, in which it starts generating dozens, hundreds, thousands, or even millions (in the case of Alphacode 2) of possible solutions to the problem.

But why do this? The intuition is simple.

Instead of hoping the model gets the answer correct on the first try, we maximize its number of tries to increase the likelihood that it will get it correct at least once, which is akin to giving the model more time to get it right.

In a way, this is much more fair to the model. Whenever humans deal with a complex problem, we are allowed to explore possible solutions, like trying different ways to solve a maths problem. Here, we are applying this principle to LLMs.

But how do we define which solution is best?

The Verifier Problem

Let’s say your model generates 1,000 possible solutions to a problem. How do we decide which one is best? Using verifiers.

A Large Language Monkey architecture

There are three types:

  • Automatic verifiers. In situations like code, you may have a set of unit tests to validate whether a code snippet solves a particular problem. In those situations, you can automatically verify whether a solution is correct or not.

  • LLM verifiers. Inspired by the seminal work of OpenAI in 2021, we can train additional LLMs to act as verifiers of the ‘thoughts’ generated by the main LLM (or use the LLM itself).

  • A combination of both. In situations where automatic verification is not an option (for instance, when the model generates 100 possible summaries to a text, choosing the best one is subjective and thus cannot be automatically decided), you can use an LLM to take the output and transform its format to make it automatically verifiable.

As for the last technique, Google used this method to train AlphaProof, which we covered two weeks ago in this newsletter.

Simply put, the LLM (Gemini) took maths problems in natural language and reformatted them into formal statements that could be verified using the Lean programming language and training the model on them.

Of course, combining the main LLM and the verifier can lead to more complex model architectures, in which the LLM uses the verifier to explore open-ended questions until it finds an answer.

One common method of enabling search is to use Monte Carlo Tree Search, or MCTS. Fascinatingly, we just saw a couple of minutes ago a product that uses this idea precisely: MultiOn’s Agent Q, as explained in the newsreel above.

But are the benefits of sampling multiple solutions that big of a deal? Researchers have explored this paradigm in detail to find some fascinating insights.

Seeing LLMs in a New Light

Probably the most evident discovery is that, in many cases, going for the ‘most intelligent’ model is the wrong decision.

The biggest isn’t always the best.

To prove this, they analyzed whether a smaller LLM that samples multiple solutions to a given problem would outperform a bigger and smarter LLM without multiple sampling.

And what did researchers find?

Impressively, when running DeepSeek-V2-coder, a small language model with multiple sampling, the model outperformed state-of-the-art models like GPT-4o or Claude 3.5 Sonnet, achieving a new state-of-the-art 56% in SWE-Bench Lite (a benchmark that evaluates a model’s capacity to solve GitHub issues), while these two models, combined, achieved 43%.

This illustrates the power of multiple sampling, as this same model with just one opportunity per question just reached 15.9%.

Importantly, this test was done for a fixed FLOPs budget. Floating point operations per second measure how much compute a model consumes per second.

Thus, by ensuring a fixed budget, the amount of compute would be equal, as the larger model gets run only once but can be orders of magnitude larger.

Engineering insight: Although researchers use a more complex formula to estimate FLOPs (page 6), applying the formula OpenAI came up with of FLOPs = 6*N is common practice, where N refers to the total amount of non-embedding parameters. However, the embedding parameters are negligible for large sizes, so you can use the total amount.

For instance, Llama 3.1 8B has an inference budget of 48 billion FLOPs per prediction.

Additionally, they proved that this same model, sampled just five times, solved more issues than GPT-4o and Claude 3.5 Sonnet in that same dataset while being at least three times cheaper.

Seeing this, users and enterprises alike should automatically reconsider every single GenAI implementation they have done, as it could very well be the case they—you—are overpaying for their—your—models.

It’s important to note that through fine-tuning you can train small models to outperform large ones even with just one sample in both cases, as Predibase proved a while back by training 25 fine-tunings of Mistral 7b that all outperformed GPT-4.

But researchers didn’t stop there. The team then explored what could be considered the first try at visualizing inference scaling laws.

The Forgotten Unlocker

You have probably heard about ‘scaling laws’ that suggest that model size and increased total compute are great predictors of improved performance. In layperson’s terms, the larger the training run (larger model and larger total invested compute), the better the outcome.

While this seems to be true, the recent stagnation in ‘model intelligence’ is probably because we have been stuck at the tens of millions of ExaFLOP compute budget for two years now (with Llama 4 being the first announced model that will officially move into the hundreds of millions of ExaFLOPs, according to Mark Zuckerberg).

However, nobody has tested whether this is true for inference computingCan we predict better performance with increased inference budgets, aka running models in real time for longer?

Well, these researchers have proven the answer is yes.

As seen below, independently of the model used, every model improved coverage (the number of test questions it could solve) as the number of samples per question increased.

Essentially, this means that whenever an LLM is faced with an issue to solve, increasing the inference computation by increasing the number of tries the model has to solve the problem should yield a net improvement.

This may sound obvious, but it was not proven until this research came out. All things considered, it’s extremely encouraging to know that you now have a simple-to-implement way to improve the results of your deployment:

Give models more time to think.

TheWhiteBox’s take:

Technology:

While increased inference time was expected to improve results, this is the first time we have visually observed such improvements. Importantly, this research also clarified that, in many cases, you should prioritize multiple-sample implementations over using the best model available ‘just because.’

On the flip side, researchers point out a fatal flaw in current LLM-based verifiers. Non-automatic verifiers (like LLMs) seem to plateau, meaning that while inference scaling laws might be true in all cases, we have yet to create truly robust LLM verifiers that can confidently scale.

Agent Q, covered earlier, uses this precise method, which serves as a great reminder that in AI, nothing is ever as it seems, and as we enter the AI Agent era, you have to bear in mind that LLM verifiers are an unsolved problem no matter what benchmarks and rehearsed demos tell you.

Products:

In my humble view, labs that are building smaller models, like FAIR (Meta), DeepSeek, or Mistral, should try to build multiple-sample models that work this way by design (you can build the agentic workflow around one-shot LLMs, but that’s an extra effort for the user/enterprise).

People don’t care if the model is dumber if the result is better (and cheaper) than a SOTA model.

So, what are they waiting for?

Markets:

With model intelligence stagnating and no signs of seeing a GPT-5-level model anytime soon, long-inference models could be the signal investors are desperately looking for to continue to trust the huge CapEx investments that Big Tech is doing in Generative AI.

In my opinion, acknowledging the limitations of the current frontier but showing a way to improve results at scale would be a no-hype narrative that would feel honest and, importantly, more than enough to keep the money flowing into the AI industry.

THEWHITEBOX
Closing Thoughts 🧐

With the ‘LLM card’ as used as it gets, the industry is swiftly transitioning into agents to continue fueling the AI frenzy.

Fear not, as over the next weeks and months, incumbents will try to bamboozle you into believing that AI Agents are finally/irremediably/unapologetically here as an unstoppable force of nature that ‘will change the world.’

But are they really ready? This Sunday, we will dive into the era of AI agents to separate truth from hype.

Until next time!

What’s coming next? Separating truth from hype as we enter the Agent era.

For business inquiries, reach me out at [email protected]