- TheTechOasis
- Posts
- Monkey LLMs, AI Drones, Grok-2, & The King's Return
Monkey LLMs, AI Drones, Grok-2, & The King's Return
THEWHITEBOX
TLDR;
đ New Premium Content on AI applications, Hyperscaler fallacy, & more
đ° Hot news from Frontier AI labs, Amazing New Agents, and Googleâs new smartphone AI features
đď¸ Insights on AI Drone wars, rethinking our framing on AI, tech layoffs, and an interview with Google Deepmindâs CEO
đ Trend of the Week: Large Language Monkeys that Surpass Frontier Models
PREMIUM CONTENT
Other New Premium Content
PREMIUM CONTENT
Things You Have Missed By Not Being PremiumâŚ
NEWSREEL
LLMs, Agents, & War
This week, we have news from Google, xAI, OpenAI, Anthropic, and MultiOn.
PRODUCT
Google Presents New AI Features
Google presented a new set of AI features in their Google Pixel event this week.
Among the different features, we have call notes, which summarize your phone conversations, and Gemini Live, which allows a Gemini model to interact with many of your applications to take action on your behalf (similar to Siri).
Importantly, this does not require continuous back-and-forth between the apps (everything is executed under the hood, which is lovely).
TheWhiteBoxâs take:
Although the model will fail many times (as in the live demo during the presentation), this is expected. My main concern is that this Gemini version will obviously run in the cloud.
This is a stark comparison with Apple, which at least has tried to have some of the features run in your smartphone. Quite honestly, Iâm not sure Iâm comfortable with my personal phone calls going to a Google server to be summarized.
That said, the use cases shown seem more complex than what Apple promised in June, so it makes sense that most of them canât be executed by an on-device AI.
LLMs
xAI Launches Grok-2, New Frontier Model
xAI, the AI company founded by Elon Musk that recently raised $6 billion, has launched the second version of its large language model, Grok.
Itâs on par with most frontier AI models, even beating GPT-4o and Claude 3.5 Sonnet in some tasks. They also released a smaller Grok-2-mini version.
They have also partnered with Flux.01, which might be considered the best image generator in the world right now, from Black Forest Labs, a company founded by the creators of the original Stable Diffusion.
TheWhiteBoxâs take:
Considering how good the model is, itâs not the super colossal news you would have expected. It seems agents are sucking all the air in the room from LLMs.
In my opinion, this is just a little bit of âshowing offâ from xAI trying to maintain momentum ahead of the release of Grok-3, which should be released before the end of the year and will be trained in the worldâs largest GPU cluster, a 100,000 H100 GPU install base.
If youâre wondering how large that data center is, it will require 140 MW of power (enough to power more than 100,000 homes) and may have cost around $4 billion to build ($2 billion to buy the GPUs, another $2 billion to buy land, other equipment, and labor).
LLMs
OpenAI Reclaims Throne
After some weeks or even months, OpenAI has reclaimed the first spot as best overall frontier AI, according to LMSYSâ leaderboard, thanks to âGPT-4o-latestâ.
Interestingly, this time, they have released two updates instead of one, with the codenamed âlatestâ optimized for chat (this is the one that has claimed the first spot) and âgpt-4o-08-06â intended more for API-style tasks like function-calling, instruction following, and so on, as explained by an OpenAI employee.
Still, no giant leap in intelligence in the horizon anytime soon.
ENTERPRISE
Anthropic Presents Prompt Caching
Anthropic, OpenAIâs main competitor and creator of Claude, which many consider the best AI in the world, has announced prompt caching.
In simple terms, in situations where part of the prompt repeats itself over multiple interactions (like having a conversation with the chatbot), the system detects the repetition and caches that part of the prompt, avoiding recomputing the attention mechanism for that part.
This can lead to savings of up to 90% in API costs and 80% in latency, a dream come true for enterprises.
TheWhiteBoxâs take:
Transformers, the architecture behind models like Claude or ChatGPT, are very inefficient. The reason is that they donât compress state.
In laymanâs terms, when it processes a sequence, it stores every word in the cache. It doesnât remember some previous parts of the conversation, but all of them.
As you engage in a conversation with the chatbot, previous parts of the conversation must be fed back to the model to recall the previous things you said, which creates considerable redundancy.
With prompt caching, the model can store previous interactions in that conversation and avoid recomputation, significantly decreasing the response time for every new interaction as the number of computations decreases (and so do costs).
DeepSeek first introduced this idea, but itâs nice that larger labs are adopting this method.
NEWS OF THE WEEK
MultiOn Launches Agent Q
In what I consider the most impressive release of the week, MultiOn has presented a new breakthrough in web navigation agents. These agents can perform complex planning and multi-step reasoning actions that allow them to interact with and even self-correct their actions with multiple websites on command.
Fascinatingly, for the first time, it beats humans on these types of searches when equipped with online search, and boosts the performance of LLMs tremendously with examples like Llama 3 70B going from 18.6% to 95% accuracy.
The link above includes a video demonstration.
TheWhiteBoxâs take:
Crucially, we cover the method that enables such impressive behavior in the article below, search. In other words, they allow models to explore different possible solution paths.
Additionally, MultiOn mentions the capacity to âself-heal,â shown in the video, in which the model initially fails to act because OpenTable doesnât show availability in the restaurant they want to book. Hence, the model performs a Google search indicating a place to book, luring the model to return to OpenTable and finish the task.
Hype aside, couple of things. As youâll see in the article below, using LLMs as verifiers to review the modelâs plans and ideas (here, MultiOn forces the LLM to self-critique its actions) has a performance plateau.
Also, if we factor in Appleâs recent ToolSandbox benchmark, which is much more prepared to evaluate an agentâs true multi-step behavior, chances are Agent Q still has essential gaps to close.
Seeing the explosion of AI agent research, is there a doubt that we are now officially in the agent era?
More on that on Sunday.
LEARN
The Insights Corner
𫡠Whoâs winning the race to AI-powered drones, the US or China? by the Wall Street Journal
đ¤ Unreasonably effective AI with Demis Hassabis, CEO of Google Deepmind
đ Fireworks CEO on how SLMs will democratize AI use cases, by Sequoia Capital
TREND OF THE WEEK
Large Language Monkeys: Is The Best Model Always the Best Option?
Google Deepmind, Standford, and Oxford research teams have presented compelling evidence that opting for the âmost intelligentâ LLM by default is a tremendous error.
When used as monkeys, smaller LLMs can confidently surpass the might of frontier AI models. It also offers some of the deepest insights on âlong-inferenceâ models that will make you doubt every intuition youâve built around LLMs.
In a nutshell, this weekâs âtrend of the weekâ will broaden your perspective on Large Language Models (LLMs) and might even force you to rethink your understandingâand strategyâregarding Generative AI.
Large Language Monkeys
As we discussed many times, LLMs are still not that intelligent. Thus, there are two ways in which academia openly researches how to improve this.
The Levels of Intelligence
As we covered last week, LLMs are going to improve their intelligence in two ways:
Compression: By giving models âtime to learn,â they innately become better thinkers by finding patterns in data. This type of intelligence allows models to perform reasoning instinctively, quickly, and without second thoughts.
Search: By giving models âtime to think,â they explore possible solutions at runtime until they find the best one. This is a slow, deliberate, and âconsciousâ way of solving a problem.
You may have realized these intelligence types are eerily similar to Systems 1 and 2 of thinking, the famous âthinking modesâ theory by the late Daniel Kahneman.
System 1 is fast and automatic; System 2 is slow, deliberate, and conscious.
Last week, we saw how grokking was one of the most exciting trends for improving compression and helping a model go from memorizing patterns to actually understanding them.
But how can we give models âtime to thinkâ?
The Power of Search
The infinite monkey theorem suggests that a monkey randomly pressing keys on a typewriter for an infinite duration will almost certainly produce any given text, including the complete works of William Shakespeare.
Similarly, if we gave an LLM infinite time to generate every possible solution to a problem, they would eventually find one. This premise has led to all AI labs fiddling with this idea.
But what do they mean by that? In a nutshell, whenever we ask a search-enhanced LLM to answer a given problem, the model doesnât immediately reply.
In fact, it enters a stage of deep thought, in which it starts generating dozens, hundreds, thousands, or even millions (in the case of Alphacode 2) of possible solutions to the problem.
But why do this? The intuition is simple.
Instead of hoping the model gets the answer correct on the first try, we maximize its number of tries to increase the likelihood that it will get it correct at least once, which is akin to giving the model more time to get it right.
In a way, this is much more fair to the model. Whenever humans deal with a complex problem, we are allowed to explore possible solutions, like trying different ways to solve a maths problem. Here, we are applying this principle to LLMs.
But how do we define which solution is best?
The Verifier Problem
Letâs say your model generates 1,000 possible solutions to a problem. How do we decide which one is best? Using verifiers.
A Large Language Monkey architecture
There are three types:
Automatic verifiers. In situations like code, you may have a set of unit tests to validate whether a code snippet solves a particular problem. In those situations, you can automatically verify whether a solution is correct or not.
LLM verifiers. Inspired by the seminal work of OpenAI in 2021, we can train additional LLMs to act as verifiers of the âthoughtsâ generated by the main LLM (or use the LLM itself).
A combination of both. In situations where automatic verification is not an option (for instance, when the model generates 100 possible summaries to a text, choosing the best one is subjective and thus cannot be automatically decided), you can use an LLM to take the output and transform its format to make it automatically verifiable.
As for the last technique, Google used this method to train AlphaProof, which we covered two weeks ago in this newsletter.
Simply put, the LLM (Gemini) took maths problems in natural language and reformatted them into formal statements that could be verified using the Lean programming language and training the model on them.
Of course, combining the main LLM and the verifier can lead to more complex model architectures, in which the LLM uses the verifier to explore open-ended questions until it finds an answer.
One common method of enabling search is to use Monte Carlo Tree Search, or MCTS. Fascinatingly, we just saw a couple of minutes ago a product that uses this idea precisely: MultiOnâs Agent Q, as explained in the newsreel above.
But are the benefits of sampling multiple solutions that big of a deal? Researchers have explored this paradigm in detail to find some fascinating insights.
Seeing LLMs in a New Light
Probably the most evident discovery is that, in many cases, going for the âmost intelligentâ model is the wrong decision.
The biggest isnât always the best.
To prove this, they analyzed whether a smaller LLM that samples multiple solutions to a given problem would outperform a bigger and smarter LLM without multiple sampling.
And what did researchers find?
Impressively, when running DeepSeek-V2-coder, a small language model with multiple sampling, the model outperformed state-of-the-art models like GPT-4o or Claude 3.5 Sonnet, achieving a new state-of-the-art 56% in SWE-Bench Lite (a benchmark that evaluates a modelâs capacity to solve GitHub issues), while these two models, combined, achieved 43%.
This illustrates the power of multiple sampling, as this same model with just one opportunity per question just reached 15.9%.
Importantly, this test was done for a fixed FLOPs budget. Floating point operations per second measure how much compute a model consumes per second.
Thus, by ensuring a fixed budget, the amount of compute would be equal, as the larger model gets run only once but can be orders of magnitude larger.
Engineering insight: Although researchers use a more complex formula to estimate FLOPs (page 6), applying the formula OpenAI came up with of FLOPs = 6*N is common practice, where N refers to the total amount of non-embedding parameters. However, the embedding parameters are negligible for large sizes, so you can use the total amount.
For instance, Llama 3.1 8B has an inference budget of 48 billion FLOPs per prediction.
Additionally, they proved that this same model, sampled just five times, solved more issues than GPT-4o and Claude 3.5 Sonnet in that same dataset while being at least three times cheaper.
Seeing this, users and enterprises alike should automatically reconsider every single GenAI implementation they have done, as it could very well be the case theyâyouâare overpaying for theirâyourâmodels.
Itâs important to note that through fine-tuning you can train small models to outperform large ones even with just one sample in both cases, as Predibase proved a while back by training 25 fine-tunings of Mistral 7b that all outperformed GPT-4.
But researchers didnât stop there. The team then explored what could be considered the first try at visualizing inference scaling laws.
The Forgotten Unlocker
You have probably heard about âscaling lawsâ that suggest that model size and increased total compute are great predictors of improved performance. In laypersonâs terms, the larger the training run (larger model and larger total invested compute), the better the outcome.
While this seems to be true, the recent stagnation in âmodel intelligenceâ is probably because we have been stuck at the tens of millions of ExaFLOP compute budget for two years now (with Llama 4 being the first announced model that will officially move into the hundreds of millions of ExaFLOPs, according to Mark Zuckerberg).
However, nobody has tested whether this is true for inference computing. Can we predict better performance with increased inference budgets, aka running models in real time for longer?
Well, these researchers have proven the answer is yes.
As seen below, independently of the model used, every model improved coverage (the number of test questions it could solve) as the number of samples per question increased.
Essentially, this means that whenever an LLM is faced with an issue to solve, increasing the inference computation by increasing the number of tries the model has to solve the problem should yield a net improvement.
This may sound obvious, but it was not proven until this research came out. All things considered, itâs extremely encouraging to know that you now have a simple-to-implement way to improve the results of your deployment:
Give models more time to think.
TheWhiteBoxâs take:
Technology:
While increased inference time was expected to improve results, this is the first time we have visually observed such improvements. Importantly, this research also clarified that, in many cases, you should prioritize multiple-sample implementations over using the best model available âjust because.â
On the flip side, researchers point out a fatal flaw in current LLM-based verifiers. Non-automatic verifiers (like LLMs) seem to plateau, meaning that while inference scaling laws might be true in all cases, we have yet to create truly robust LLM verifiers that can confidently scale.
Agent Q, covered earlier, uses this precise method, which serves as a great reminder that in AI, nothing is ever as it seems, and as we enter the AI Agent era, you have to bear in mind that LLM verifiers are an unsolved problem no matter what benchmarks and rehearsed demos tell you.
Products:
In my humble view, labs that are building smaller models, like FAIR (Meta), DeepSeek, or Mistral, should try to build multiple-sample models that work this way by design (you can build the agentic workflow around one-shot LLMs, but thatâs an extra effort for the user/enterprise).
People donât care if the model is dumber if the result is better (and cheaper) than a SOTA model.
So, what are they waiting for?
Markets:
With model intelligence stagnating and no signs of seeing a GPT-5-level model anytime soon, long-inference models could be the signal investors are desperately looking for to continue to trust the huge CapEx investments that Big Tech is doing in Generative AI.
In my opinion, acknowledging the limitations of the current frontier but showing a way to improve results at scale would be a no-hype narrative that would feel honest and, importantly, more than enough to keep the money flowing into the AI industry.
THEWHITEBOX
Closing Thoughts đ§
With the âLLM cardâ as used as it gets, the industry is swiftly transitioning into agents to continue fueling the AI frenzy.
Fear not, as over the next weeks and months, incumbents will try to bamboozle you into believing that AI Agents are finally/irremediably/unapologetically here as an unstoppable force of nature that âwill change the world.â
But are they really ready? This Sunday, we will dive into the era of AI agents to separate truth from hype.
Until next time!
Give a Rating to Today's Newsletter |
Whatâs coming next? Separating truth from hype as we enter the Agent era.
For business inquiries, reach me out at [email protected]