• TheTechOasis
  • Posts
  • OpenAI's Strawberry & Google's Million-Expert LLM

OpenAI's Strawberry & Google's Million-Expert LLM

THEWHITEBOX
TLDR;

  • 🗞️ News from the big tech crash, Andrej Karpathy’s new venture, OpenAI and Salesforce

  • 📚 The Resource Corner: Some of what I’ve been reading/seeing over the last week

  • 🍓 OpenAI's Project Strawberry

  • ✨ Google's Million-Expert LLM

NEWSREEL
Week’s Update 🚨

Over the past few days, the ‘Magnificent Seven’, the seven largest big tech companies, saw the erosion of $1.1 trillion, the Netherlands’ entire GDP (2022), and $500 billion on Wednesday alone, amid fears of potential upcoming higher restrictions on exports to China imposed by Joe Biden, an escalation in the ongoing chip war between the US and China.

The fall also affected Asian stocks, as well as the Dutch ASML, and basically any company related to the chips supply chain.

In the case of the hyperscalers, the big tech companies providing cloud services, the fall was particularly steep amid concerns that the US Government would soon target them from providing GPU services to Chinese companies, reducing the impact of the export sanctions.

Also, the overall sentiment toward AI is shifting, especially considering Goldman Sachs’ report listed below.

Moving on, Salesforce has announced a new AI Service Agent for its platform, using their Einstein Copilot as the foundation. It can autonomously handle customer service tasks and integrate with company records for seamless support.

The main takeaway for me is what I outlined in my winners & losers article; most SaaS products will evolve into declarative apps where you leverage an AI agent to perform the actions. I predict some of these products will even end up with no interface besides chat.

Andrej Karpathy, legendary researcher, has announced the creation of Eureka Labs, an AI-native school. Andrej is known for being extremely educational about AI, and this company could personify that mission.

Can’t recommend it’s Zero-to-One Neural Network series enough (it’s free).

As for our weekly touchpoint on OpenAI, anonymous ex-OpenAI whistleblowers have accused OpenAI of using NDAs that illegally restrict employees from reporting issues to regulators.

On the flip side, they have released exciting research using games to train LLMs to be more transparent and explanatory.

Lately, OpenAI has been immersed in a bad PR flywheel due to its excessive closeness (the irony considering it is called OpenAI).

As for the research, it’s becoming clear that the future of OpenAI models screams ‘verifiers.’ That is, using smaller models in real time to verify or validate the big model's ‘thoughts’ and allow it to search for solutions.

Care to understand more? Read the article below on Project Strawberry.

LEARN
The Resource Corner 📚

😱 GenAI: Too Much Spend, Too Little Benefit, by Goldman Sachs

🌳 AI Optimism vs AI Arms Race, by Sequoia Capital

🫡 Misha Laskin on the Future of AI Agents, from the Training Data Podcast

🤩 Microsoft’s Spreadsheet LLM, a new model specifically designed to work with spreadsheets, and Excel’s Future?

🧠 Extremely insightful philosophical discussion on LLMs, by ML Street Talk with Murray Shanahan (Google)

OPENAI
Project Strawberry 🍓

In an exclusive, Reuters claims having obtained leaked internal OpenAI documents that reveal OpenAI’s plans.

They talk about a ‘new leap in reasoning’ that allows a new model type to be extremely performant at tasks like maths, reaching 90% on benchmarks where current LLMs give you a similar result to using pure chance.

Thus, understanding Strawberry not only provides a clear view of what’s to come but also gives insight into what might be the real moat in AI: reasoning, a dire need for OpenAI considering that current LLMs, which are heavily commoditized, no longer offer.

An Evolving Mystery

Back in November last year, Sam Altman was fired.

The reasons were never made clear, but many rumored that Ilya Sutskever, one of OpenAI’s cofounders, had ‘seen something’ that spooked him so much that he felt the need to fire the CEO, notorious for moving very fast and potentially cutting corners to win.

That ‘something’, known originally as Q*, was a new model that combined a pre-trained model (probably a GPT-style model) with a search algorithm.

These new models, openly researched by other labs like Google, are known as ‘long-inference models’ (or ‘Long Horizon Task’ models as defined in the article).

They were already covered by the previous newsletter issue from last Monday, but long story short, instead of simply providing the first answer they predict, they iterate over their own ‘thoughts’ until they find the best one.

Breaking the Hype.

Although iterative processes do increase the overall performance of the model (as we discussed), it’s uncertain whether they are the key to surpassing human capabilities, as the pre-trained backbone, the LLM, was still trained to imitate human data.

So, what are the key things to know about Strawberry?

A New Post-Training Standard

Insiders claim Strawberry is a new post-training method that considerably elevates these models' reasoning capabilities, and its primary role will be to perform ‘deep research.’

But what do we mean by ‘post-training method’?

Current LLMs follow two training phases:

  • Pre-training: Here, the model learns how words follow each other. This way, it learns to predict the next one, indirectly increasing its knowledge about the world, giving you GPT-type models.

  • Post-training: Here, we tailor the model’s behavior, teaching it to follow instructions and, very importantly, avoid potentially harmful answers (aka making it safer to use), giving you the actual product (i.e. ChatGPT).

It’s important to note that all the knowledge the model absorbs from data is gathered in the pre-training phase. Therefore, during post-training, we simply fine-tune how it’s supposed to behave.

Knowing this, what could they possibly be referring to?

Based on their own research and the leaked data, they seem to have worked on this problem in two ways: Enhancing the model’s System 1 thinking and developing new System 2 thinking capabilities.

System 1 & 2 thinking refers to the late Daniel Kahneman’s research on how humans think; the former is unconscious, fast, and instinctive, like driving a car, and the latter is conscious, slow, and deliberate (solving a math problem).

But what does that have to do with LLMs? Well, System 1 is the default mode of LLMs, a no-second-thoughts response to a user’s question.

And how do we enhance a model’s System 1 reasoning? Simple: improve the data.

Augmenting Reasoning Data

To enhance a model’s innate response, you need better data. Thus, they needed a dataset specifically conceived to improve reasoning capabilities. Then, using a process-supervised reward model (or PRM) like the one they proposed in their let’s verify step-by-step paper, they can train a model to perform complex reasoning by design (or by ‘pure instinct’).

For instance, in the image below, a training example from the dataset they also released with the aforementioned paper, we can see how the data is much more focused on ensuring a step-by-step approach to problem-solving that is rarely seen in open data.

And what is this thing called Process-supervised Reward Model?

A common method for tailoring the model's behavior during post-training is to use a reward model that rewards the LLM whenever it chooses the best response or punishes it when it chooses the worst.

While standard reward models (called Outcome-supervised Reward Models) usually examine the response to determine whether it’s right or wrong, PRMs reward or punish every single step of the process.

As seen below, the model gives the ‘ok’ to all steps but punishes the model for making a mistake in the last one:

Source: OpenAI

Thus, by combining step-by-step data and rewards that look into every single step, we are indirectly incentivizing the model to memorize these reasoning thought processes, increasing the likelihood that, when facing a similar task, it executes the steps the same way.

And how have they developed the model’s System 2 capabilities?

Introducing Verifiers

On Monday, we mentioned that using verifiers can help long-inference models converge on an answer after exploring different ways to solve a problem.

This idea, casually introduced by OpenAI back in 2021 and reinforced in their paper released yesterday, might be a critical piece of Strawberry.

The idea is quite simple: the LLM generates possible thought paths, and the verifier (another, smaller LLM) validates every thought, helping search for the best possible solution.

Consequently, based on OpenAI’s research, Strawberry seems to be a model that improves reasoning considerably by being ‘given time to think’ and optimized along two directions:

  • Improving the thought-quality baseline thanks to the step-by-step dataset,

  • but will also be allowed to actively search for the best possible solution to a problem in real-time,

Potentially inaugurating a new chapter in AI: The reasoning era.

TheWhiteBox’s take

Technology:

This research, technologically speaking, is as cutting-edge as they get, as we might witness the new era of frontier AI. But, as always, if there’s one thing that blocks technological development, that’s economics.

It could very well be the case that, due to the expected insane compute costs of running an active search model, they might have settled for a production a model that only focuses on the first step, enhancing System 1 thinking.

Products:

As Mira Murati acknowledges that we shouldn’t expect GPT-5 this year, the Reuters article clarifies that this is a post-training method. This means that OpenAI will probably release Strawberry as a GPT-4.5 version before deploying GPT-5 in 2025.

Markets:

Please remember that OpenAI has begun to overpromise and underdeliver with constant delays and unsubstantiated claims. Therefore, although I can’t hide my enthusiasm about this model, we need OpenAI to ‘walk the talk’ before giving in to the hype.

That said, Strawberry-type models can become a huge moat for incumbents.

Not only is building the reasoning dataset already completely out of reach economically for most labs, but running these models at scale will prevent almost anyone except Deepmind and Anthropic, and only because both are backed by Google and Amazon respectively, from competing.

Learn AI in 5 Minutes a Day

AI Tool Report is one of the fastest-growing and most respected newsletters in the world, with over 550,000 readers from companies like OpenAI, Nvidia, Meta, Microsoft, and more.

Our research team spends hundreds of hours a week summarizing the latest news, and finding you the best opportunities to save time and earn more using AI.

GOOGLE
The Million-Expert LLM ✨

In what I predict will soon become a standard for LLM training, Google has achieved the coveted dream of many labs: extreme expert granularity.

This is thanks to PEER, a breakthrough that allows a Large Language Model (LLM) to be broken down into millions of experts in inference, achieving a balanced equilibrium between size and costs that not only might be economically irresistible but a matter of necessity.

Also, it sends a clear message from Google to the world.

Size At All Costs

As we’ve covered multiple times, LLMs crave size above all else. This is because these models develop new capabilities as the number of parameters increases.

Consequently, every new generation of these models is usually much larger than the one before, with GPT-4 being ten times the size of GPT-3 and GPT-5 rumored to be 30 times larger than GPT-4 if we use Kevin Scott’s, Microsoft’s CTO, reference to the size of the training cluster.

However, you will probably have heard that these models are expensive to run and that the larger they are, the more expensive they become, especially during inference (when they are run).

But why?

Frontier AI models are memory-bound, meaning they saturate the memory of the GPUs before actually saturating the GPU’s processing capacity.

While state-of-the-art GPUs have immense processing power, their memory is fairly small. Consequently, these models must be spread across several GPUs, in the multiple dozens for larger models.

Thus, LLM inference clusters run at a heavy processing discount, rarely exceeding 40-50% of the global processing capacity. Moreover, you need to factor in GPU-GPU communication costs, which affect the overall latency of the model.

Now, while solving the memory problem isn’t very tractable unless you make models smaller, perform quantization (decrease parameter precision, aka fewer decimals), or optimize the inference cache, all of which impact performance, there’s a really great way to improve throughput by reducing the computation required without affecting performance.

But how?

The Mixture-of-Experts Problem

A mixture-of-experts model (MoE) essentially breaks down a model into smaller networks, known as experts, that regionalize the input space.

In layman’s terms, each expert becomes proficient in different input topics. Consequently, during inference, we route the input to the top-k experts for that topic, discarding the rest.

Technically speaking, knowing that LLMs are a concatenation of Transformer blocks (image below, left), we introduce a new type of layer, a MoE layer, in substitution of the standard MLP layers, identical but fragmented into experts.

This is crucial, considering MLPs account for more than two-thirds of the model's overall parameter count and represent a disproportionate amount of the required compute, reaching 98% in some cases.

For example, Mixtral-8×22B is a 176-billion-parameter model divided into eight experts. Thus, if we activate only one expert per prediction instead of running the entire model, only 22 billion parameters activate.

A MoE Layer

This decreases the model’s compute requirements roughly by eight, although the calculation is not exact as the model is divided only on the Feedforward layers, with the attention mechanism remaining untouched.

And what about latency?

This is much harder to estimate, as we have to factor certain distributed assumptions, like whether the model has been spread through expert, data, tensor, or pipeline parallelization (or a combination of them).

As a reference, in Mixtral-8×7B, Mistral claimed a 4x computational drop (2 active experts for each prediction) with sa 6x latency decrease.

Excitingly, as Google proved back in 2022, dividing the network into experts seems to improve performance, too, which has led to an absolute frenzy of MoE models over the last few years (i.e. ChatGPT).

However, although higher granularity (more experts) is extremely desirable, it was an intractable problem that prevented labs from using more than 8-64 experts, seriously constraining our capacity to run larger models at scale.

Until now.

The Million-Expert Network

Google has created Parameter Efficient Expert Retrieval, or PEER, a new method for dividing the LLM into a million experts.

Specifically, the model performs cross-attention between the input and the experts as a retrieval method. But what does that mean?

At the core of an LLM is the attention mechanism, which allows words in the input sequence to talk to each other, share information, and update their meaning with the surrounding context.

For the sequence “I love my baseball bat“, we use attention to update the meaning of “bat” with “baseball” so that now “bat” refers to a baseball bat, not an animal.

Here, however, we use attention to retrieve the most adept expert for any given input. The most intuitive way to think about this is the model using the input as a query and asking the experts: “Which ones of you are most adequate for this input?”

The retrieved experts are then queried, and their information is embedded into the input.

For instance, if the input is “Michael Jordan”, we might retrieve experts in basketball and Nike and embed the concepts of “basketball” and “Nike athlete” into the input to differentiate the player from the Hollywood actor.

The PEER layer Source: Google

When compared to other more traditional methods, this solution isn’t only much more compute-efficient, but drops perplexity too (the metric by which we evaluate a model’s quality), arriving at an extremely rare avis in the world of AI:

All things considered, these models offer better compute performance and are better models overall. With PEER, this balance can be taken to the extreme regarding cost efficiency.

TheWhiteBox’s take:

Technology:

As we’ve covered today, Google has found a way to increase the granularity of experts up to a million, decreasing inference costs for these models massively.

They also mention that this method applies to LoRA adapters, a clear hint at Lamini.ai’s recent MoME architecture, which we covered last week.

It’s important to note that this development is orthogonal to other efficiency-enhancing implementations that aim to optimize memory usage. However, most of those haven’t gotten the traction that PEER will get, considering that this method has no visible drawbacks, at least in the single-digit billion-size models.

Products:

Based on the signatures of the paper, it’s quite possible that Gemini 1.5 Flash, the insanely fast yet highly performant Gemini model, is running a PEER architecture already and that most open-source implementations will adopt this method, no questions asked.

Markets:

Google’s future is intertwined with AI. No longer lagging behind OpenAI in technological development, something rarely mentioned is that compute-wise, they are head and shoulders above everyone else.

They also have the largest amount of available data, especially video, with the equivalent of 2 quadrillion words of video data uploaded to YouTube daily, enough to train 143 LLaMa 3 models, and sufficient free cash flow to lead the next frontier of AI models (just had their first dividend ever and a $70 billion stock buyback, aka they are rich as hell).

And if that wasn’t enough, their relevance in the greatest Internet product, search, is increasing.

But is all this enough for Google to survive its issues? We’ll take a look at that question this Sunday.

THEWHITEBOX
Closing Thoughts 🧐

This week, we have OpenAI and Google at both ends; one is pushing the next generation, while the other focuses on making our models more efficient.

However, the highlight of the week are the markets, dragged by the bearish sentiment toward the chip supply chain and the realization that AI is not generating predicted demand, meaning the one hundred billion GPU investment (this year alone) is still looking to justify its cost.

Until next time!

Do you have any feelings, questions, or intuitions you want to share with me? Reach me at [email protected]