- TheTechOasis
- Posts
- OpenAI's Strawberry & Google's Million-Expert LLM
OpenAI's Strawberry & Google's Million-Expert LLM
THEWHITEBOX
TLDR;
đď¸ News from the big tech crash, Andrej Karpathyâs new venture, OpenAI and Salesforce
đ The Resource Corner: Some of what Iâve been reading/seeing over the last week
đ OpenAI's Project Strawberry
⨠Google's Million-Expert LLM
NEWSREEL
Weekâs Update đ¨
Over the past few days, the âMagnificent Sevenâ, the seven largest big tech companies, saw the erosion of $1.1 trillion, the Netherlandsâ entire GDP (2022), and $500 billion on Wednesday alone, amid fears of potential upcoming higher restrictions on exports to China imposed by Joe Biden, an escalation in the ongoing chip war between the US and China.
The fall also affected Asian stocks, as well as the Dutch ASML, and basically any company related to the chips supply chain.
In the case of the hyperscalers, the big tech companies providing cloud services, the fall was particularly steep amid concerns that the US Government would soon target them from providing GPU services to Chinese companies, reducing the impact of the export sanctions.
Also, the overall sentiment toward AI is shifting, especially considering Goldman Sachsâ report listed below.
Moving on, Salesforce has announced a new AI Service Agent for its platform, using their Einstein Copilot as the foundation. It can autonomously handle customer service tasks and integrate with company records for seamless support.
The main takeaway for me is what I outlined in my winners & losers article; most SaaS products will evolve into declarative apps where you leverage an AI agent to perform the actions. I predict some of these products will even end up with no interface besides chat.
Andrej Karpathy, legendary researcher, has announced the creation of Eureka Labs, an AI-native school. Andrej is known for being extremely educational about AI, and this company could personify that mission.
Canât recommend itâs Zero-to-One Neural Network series enough (itâs free).
As for our weekly touchpoint on OpenAI, anonymous ex-OpenAI whistleblowers have accused OpenAI of using NDAs that illegally restrict employees from reporting issues to regulators.
On the flip side, they have released exciting research using games to train LLMs to be more transparent and explanatory.
Lately, OpenAI has been immersed in a bad PR flywheel due to its excessive closeness (the irony considering it is called OpenAI).
As for the research, itâs becoming clear that the future of OpenAI models screams âverifiers.â That is, using smaller models in real time to verify or validate the big model's âthoughtsâ and allow it to search for solutions.
Care to understand more? Read the article below on Project Strawberry.
LEARN
The Resource Corner đ
đą GenAI: Too Much Spend, Too Little Benefit, by Goldman Sachs
đł AI Optimism vs AI Arms Race, by Sequoia Capital
𫡠Misha Laskin on the Future of AI Agents, from the Training Data Podcast
𤊠Microsoftâs Spreadsheet LLM, a new model specifically designed to work with spreadsheets, and Excelâs Future?
đ§ Extremely insightful philosophical discussion on LLMs, by ML Street Talk with Murray Shanahan (Google)
OPENAI
Project Strawberry đ
In an exclusive, Reuters claims having obtained leaked internal OpenAI documents that reveal OpenAIâs plans.
They talk about a ânew leap in reasoningâ that allows a new model type to be extremely performant at tasks like maths, reaching 90% on benchmarks where current LLMs give you a similar result to using pure chance.
Thus, understanding Strawberry not only provides a clear view of whatâs to come but also gives insight into what might be the real moat in AI: reasoning, a dire need for OpenAI considering that current LLMs, which are heavily commoditized, no longer offer.
An Evolving Mystery
Back in November last year, Sam Altman was fired.
The reasons were never made clear, but many rumored that Ilya Sutskever, one of OpenAIâs cofounders, had âseen somethingâ that spooked him so much that he felt the need to fire the CEO, notorious for moving very fast and potentially cutting corners to win.
That âsomethingâ, known originally as Q*, was a new model that combined a pre-trained model (probably a GPT-style model) with a search algorithm.
These new models, openly researched by other labs like Google, are known as âlong-inference modelsâ (or âLong Horizon Taskâ models as defined in the article).
They were already covered by the previous newsletter issue from last Monday, but long story short, instead of simply providing the first answer they predict, they iterate over their own âthoughtsâ until they find the best one.
Breaking the Hype.
Although iterative processes do increase the overall performance of the model (as we discussed), itâs uncertain whether they are the key to surpassing human capabilities, as the pre-trained backbone, the LLM, was still trained to imitate human data.
So, what are the key things to know about Strawberry?
A New Post-Training Standard
Insiders claim Strawberry is a new post-training method that considerably elevates these models' reasoning capabilities, and its primary role will be to perform âdeep research.â
But what do we mean by âpost-training methodâ?
Current LLMs follow two training phases:
Pre-training: Here, the model learns how words follow each other. This way, it learns to predict the next one, indirectly increasing its knowledge about the world, giving you GPT-type models.
Post-training: Here, we tailor the modelâs behavior, teaching it to follow instructions and, very importantly, avoid potentially harmful answers (aka making it safer to use), giving you the actual product (i.e. ChatGPT).
Itâs important to note that all the knowledge the model absorbs from data is gathered in the pre-training phase. Therefore, during post-training, we simply fine-tune how itâs supposed to behave.
Knowing this, what could they possibly be referring to?
Based on their own research and the leaked data, they seem to have worked on this problem in two ways: Enhancing the modelâs System 1 thinking and developing new System 2 thinking capabilities.
System 1 & 2 thinking refers to the late Daniel Kahnemanâs research on how humans think; the former is unconscious, fast, and instinctive, like driving a car, and the latter is conscious, slow, and deliberate (solving a math problem).
But what does that have to do with LLMs? Well, System 1 is the default mode of LLMs, a no-second-thoughts response to a userâs question.
And how do we enhance a modelâs System 1 reasoning? Simple: improve the data.
Augmenting Reasoning Data
To enhance a modelâs innate response, you need better data. Thus, they needed a dataset specifically conceived to improve reasoning capabilities. Then, using a process-supervised reward model (or PRM) like the one they proposed in their letâs verify step-by-step paper, they can train a model to perform complex reasoning by design (or by âpure instinctâ).
For instance, in the image below, a training example from the dataset they also released with the aforementioned paper, we can see how the data is much more focused on ensuring a step-by-step approach to problem-solving that is rarely seen in open data.
And what is this thing called Process-supervised Reward Model?
A common method for tailoring the model's behavior during post-training is to use a reward model that rewards the LLM whenever it chooses the best response or punishes it when it chooses the worst.
While standard reward models (called Outcome-supervised Reward Models) usually examine the response to determine whether itâs right or wrong, PRMs reward or punish every single step of the process.
As seen below, the model gives the âokâ to all steps but punishes the model for making a mistake in the last one:
Source: OpenAI
Thus, by combining step-by-step data and rewards that look into every single step, we are indirectly incentivizing the model to memorize these reasoning thought processes, increasing the likelihood that, when facing a similar task, it executes the steps the same way.
And how have they developed the modelâs System 2 capabilities?
Introducing Verifiers
On Monday, we mentioned that using verifiers can help long-inference models converge on an answer after exploring different ways to solve a problem.
This idea, casually introduced by OpenAI back in 2021 and reinforced in their paper released yesterday, might be a critical piece of Strawberry.
The idea is quite simple: the LLM generates possible thought paths, and the verifier (another, smaller LLM) validates every thought, helping search for the best possible solution.
Consequently, based on OpenAIâs research, Strawberry seems to be a model that improves reasoning considerably by being âgiven time to thinkâ and optimized along two directions:
Improving the thought-quality baseline thanks to the step-by-step dataset,
but will also be allowed to actively search for the best possible solution to a problem in real-time,
Potentially inaugurating a new chapter in AI: The reasoning era.
TheWhiteBoxâs take
Technology:
This research, technologically speaking, is as cutting-edge as they get, as we might witness the new era of frontier AI. But, as always, if thereâs one thing that blocks technological development, thatâs economics.
It could very well be the case that, due to the expected insane compute costs of running an active search model, they might have settled for a production a model that only focuses on the first step, enhancing System 1 thinking.
Products:
As Mira Murati acknowledges that we shouldnât expect GPT-5 this year, the Reuters article clarifies that this is a post-training method. This means that OpenAI will probably release Strawberry as a GPT-4.5 version before deploying GPT-5 in 2025.
Markets:
Please remember that OpenAI has begun to overpromise and underdeliver with constant delays and unsubstantiated claims. Therefore, although I canât hide my enthusiasm about this model, we need OpenAI to âwalk the talkâ before giving in to the hype.
That said, Strawberry-type models can become a huge moat for incumbents.
Not only is building the reasoning dataset already completely out of reach economically for most labs, but running these models at scale will prevent almost anyone except Deepmind and Anthropic, and only because both are backed by Google and Amazon respectively, from competing.
Learn AI in 5 Minutes a Day
AI Tool Report is one of the fastest-growing and most respected newsletters in the world, with over 550,000 readers from companies like OpenAI, Nvidia, Meta, Microsoft, and more.
Our research team spends hundreds of hours a week summarizing the latest news, and finding you the best opportunities to save time and earn more using AI.
GOOGLE
The Million-Expert LLM â¨
In what I predict will soon become a standard for LLM training, Google has achieved the coveted dream of many labs: extreme expert granularity.
This is thanks to PEER, a breakthrough that allows a Large Language Model (LLM) to be broken down into millions of experts in inference, achieving a balanced equilibrium between size and costs that not only might be economically irresistible but a matter of necessity.
Also, it sends a clear message from Google to the world.
Size At All Costs
As weâve covered multiple times, LLMs crave size above all else. This is because these models develop new capabilities as the number of parameters increases.
Consequently, every new generation of these models is usually much larger than the one before, with GPT-4 being ten times the size of GPT-3 and GPT-5 rumored to be 30 times larger than GPT-4 if we use Kevin Scottâs, Microsoftâs CTO, reference to the size of the training cluster.
However, you will probably have heard that these models are expensive to run and that the larger they are, the more expensive they become, especially during inference (when they are run).
But why?
Frontier AI models are memory-bound, meaning they saturate the memory of the GPUs before actually saturating the GPUâs processing capacity.
While state-of-the-art GPUs have immense processing power, their memory is fairly small. Consequently, these models must be spread across several GPUs, in the multiple dozens for larger models.
Thus, LLM inference clusters run at a heavy processing discount, rarely exceeding 40-50% of the global processing capacity. Moreover, you need to factor in GPU-GPU communication costs, which affect the overall latency of the model.
Now, while solving the memory problem isnât very tractable unless you make models smaller, perform quantization (decrease parameter precision, aka fewer decimals), or optimize the inference cache, all of which impact performance, thereâs a really great way to improve throughput by reducing the computation required without affecting performance.
But how?
The Mixture-of-Experts Problem
A mixture-of-experts model (MoE) essentially breaks down a model into smaller networks, known as experts, that regionalize the input space.
In laymanâs terms, each expert becomes proficient in different input topics. Consequently, during inference, we route the input to the top-k experts for that topic, discarding the rest.
Technically speaking, knowing that LLMs are a concatenation of Transformer blocks (image below, left), we introduce a new type of layer, a MoE layer, in substitution of the standard MLP layers, identical but fragmented into experts.
This is crucial, considering MLPs account for more than two-thirds of the model's overall parameter count and represent a disproportionate amount of the required compute, reaching 98% in some cases.
For example, Mixtral-8Ă22B is a 176-billion-parameter model divided into eight experts. Thus, if we activate only one expert per prediction instead of running the entire model, only 22 billion parameters activate.
A MoE Layer
This decreases the modelâs compute requirements roughly by eight, although the calculation is not exact as the model is divided only on the Feedforward layers, with the attention mechanism remaining untouched.
And what about latency?
This is much harder to estimate, as we have to factor certain distributed assumptions, like whether the model has been spread through expert, data, tensor, or pipeline parallelization (or a combination of them).
As a reference, in Mixtral-8Ă7B, Mistral claimed a 4x computational drop (2 active experts for each prediction) with sa 6x latency decrease.
Excitingly, as Google proved back in 2022, dividing the network into experts seems to improve performance, too, which has led to an absolute frenzy of MoE models over the last few years (i.e. ChatGPT).
However, although higher granularity (more experts) is extremely desirable, it was an intractable problem that prevented labs from using more than 8-64 experts, seriously constraining our capacity to run larger models at scale.
Until now.
The Million-Expert Network
Google has created Parameter Efficient Expert Retrieval, or PEER, a new method for dividing the LLM into a million experts.
Specifically, the model performs cross-attention between the input and the experts as a retrieval method. But what does that mean?
At the core of an LLM is the attention mechanism, which allows words in the input sequence to talk to each other, share information, and update their meaning with the surrounding context.
For the sequence âI love my baseball batâ, we use attention to update the meaning of âbatâ with âbaseballâ so that now âbatâ refers to a baseball bat, not an animal.
Here, however, we use attention to retrieve the most adept expert for any given input. The most intuitive way to think about this is the model using the input as a query and asking the experts: âWhich ones of you are most adequate for this input?â
The retrieved experts are then queried, and their information is embedded into the input.
For instance, if the input is âMichael Jordanâ, we might retrieve experts in basketball and Nike and embed the concepts of âbasketballâ and âNike athleteâ into the input to differentiate the player from the Hollywood actor.
The PEER layer Source: Google
When compared to other more traditional methods, this solution isnât only much more compute-efficient, but drops perplexity too (the metric by which we evaluate a modelâs quality), arriving at an extremely rare avis in the world of AI:
All things considered, these models offer better compute performance and are better models overall. With PEER, this balance can be taken to the extreme regarding cost efficiency.
TheWhiteBoxâs take:
Technology:
As weâve covered today, Google has found a way to increase the granularity of experts up to a million, decreasing inference costs for these models massively.
They also mention that this method applies to LoRA adapters, a clear hint at Lamini.aiâs recent MoME architecture, which we covered last week.
Itâs important to note that this development is orthogonal to other efficiency-enhancing implementations that aim to optimize memory usage. However, most of those havenât gotten the traction that PEER will get, considering that this method has no visible drawbacks, at least in the single-digit billion-size models.
Products:
Based on the signatures of the paper, itâs quite possible that Gemini 1.5 Flash, the insanely fast yet highly performant Gemini model, is running a PEER architecture already and that most open-source implementations will adopt this method, no questions asked.
Markets:
Googleâs future is intertwined with AI. No longer lagging behind OpenAI in technological development, something rarely mentioned is that compute-wise, they are head and shoulders above everyone else.
They also have the largest amount of available data, especially video, with the equivalent of 2 quadrillion words of video data uploaded to YouTube daily, enough to train 143 LLaMa 3 models, and sufficient free cash flow to lead the next frontier of AI models (just had their first dividend ever and a $70 billion stock buyback, aka they are rich as hell).
And if that wasnât enough, their relevance in the greatest Internet product, search, is increasing.
But is all this enough for Google to survive its issues? Weâll take a look at that question this Sunday.
THEWHITEBOX
Closing Thoughts đ§
This week, we have OpenAI and Google at both ends; one is pushing the next generation, while the other focuses on making our models more efficient.
However, the highlight of the week are the markets, dragged by the bearish sentiment toward the chip supply chain and the realization that AI is not generating predicted demand, meaning the one hundred billion GPU investment (this year alone) is still looking to justify its cost.
Until next time!
Give a Rating to Today's Newsletter |
Do you have any feelings, questions, or intuitions you want to share with me? Reach me at [email protected]