• TheTechOasis
  • Posts
  • OpenAI Launches o1. Here’s All You Need to Know

OpenAI Launches o1. Here’s All You Need to Know

REASONING STATE-OF-THE-ART
OpenAI’s o1 (Strawberry). Here’s All You Need to Know

The wait is over. OpenAI has finally released o1-preview and o1-mini, the bleeding-edge state-of-the-art AI models for reasoning, and o1, the most powerful model, is being refined as we speak.

OpenAI has decided to ditch the ‘GPT-x’ name altogether and use ‘o1’ to signify that this is a paradigm shift in AI comparable to the release of GPT-3 in 2020, a new way of understanding, interpreting and, crucially, prompting AI (yes, how you communicate with AI has now changed, too).

It’s a new era for AI, but not quite how some paint it.

While everyone is raging on how ‘AI is now PhD level intelligent,’ or ‘going to eat my job,’ among other hype-bloated claims, this article will give you insight into what these models are, what they are capable of, how to use them, and what has changed with their release.

In particular:

  • How they were trained?

  • How good are they really?

  • When and how to use them (sorry guys, AI has not become a one-stop shop for models; o1 models aren’t always the solution)

  • Their current limitations

  • Are you going to have to pay a lot more?

  • The general implications, including new inference scaling laws, new prompting techniques, and data flywheels, among others.

  • What is OpenAI’s moat and strategy here? (Hint: it’s not the model; it’s way more important.) toward the upcoming monstrous multiple billion round.

  • And, finally, the main takeaways and how you should approach this paradigm shift.

Reading this article will answer most of your questions, which you may have or have not considered. Let’s dive in.

The Active Inference Era is Here

Imagine people judging your intelligence but how fast and immediate your answers to tough questions were and also expect them to be accurate. You aren’t given time to plan, think of possible solutions, reflect on your answers, iterate, or back-track when choosing the wrong path; just straight answers from the top of your head.

You would be considered dumb, right? Well, that’s precisely how GPT-4o works and how it’s—unfairly—judged. But o1 models change that.

A paradigm shift in thinking

In real life, harder problems need time to be solved. In fact, our brain has two ways to solve problems depending on complexity:

  • System 1 thinking is a fast, innate (even unconscious) thinking mode where answers come immediately straight to you, often simpler questions.

  • System 2 thinking is a slow, deliberate, conscious mode in which you actively engage your prefrontal cortex to tackle complex tasks like math, coding, or planning.

Thus, the best way to summarize ‘what o1 models are' is precisely this transition from System 1 AI thinkers (GPT-4o, Claude 3, and all other previously conceived models) into System 2 AI thinkers—at least a first iteration.

In short, a new type of AI model. And how does that take form? To answer that, we need to learn the bitter lesson.

Rich Sutton, Silicon Valley’s messiah

In my life as a student of the AI arts, no other piece of knowledge has been more cited across research, start-up missions, or venture-capital blogs than The Bitter Lesson, where Rich Sutton argues that, in AI, researchers always end up acknowledging the bitter lesson: that no algorithmic or data breakthrough can compete with an increase in computation toward learning and search.

But what does that mean?

It means every step-function progress in AI boils down to increasing computing toward learning and searching methods. This breaks down intelligence into two ‘form factors’ we have covered extensively in this newsletter, but can be simplified as:

  1. Knowledge Compression

  2. Active inference search

The former is embodied by a Large Language Model (LLM) that takes in a huge data corpora and, by finding patterns in data, compresses (embeds or encodes) that knowledge into its weights.

Or, to be clear, LLMs are data compressors, AI models that take the knowledge of the world and encode it into a “smaller” packet and can regurgitate that knowledge back to you.

This introduces us to the grand scaling law that has fueled investment in AI over the last decade: training scaling laws, or ‘the larger the model and data set, aka the larger the training compute, the more intelligent the model is.’

In other words, with LLMs we are building the model's ‘native intelligence,’ its System 1 thinking capabilities, which has been the method to ‘increase intelligence in AI’ for the past years.

This is the definition of GPT-4o.

As for the latter, search refers to allowing these models to explore possible solution paths to a problem instead of forcing them to be right on the first try.

In other words, building on the native intelligence of the data compressor (the LLM), we can turn open-ended questions that, at random, would take an infinite length of time to solve, and make them solvable by allowing this model to actively search for the solution by suggesting logical paths to explore and, hopefully, converge into a final solution.

Simply put, taking a model from step 1 and letting it search solutions by increasing the ‘amount of thinking’ it can dedicate to solving a problem.

Fascinatingly, and this is one of the highlights of this release (we’ll delve more into it later), it introduces us to the next scaling law that will fuel the next AI era: inference scaling laws, or ‘the longer my model thinks, the better it thinks,’ as explicitly stated by Noam Brown, one of the masterminds behind o1.

And that, my dear reader, is what o1 models are.

Visually, the image below exposes how GPT-4o (left) and o1 (right) differ in their approach to solving a problem:

But how does it work? For that, OpenAI has actually spilled the beans quite a bit.

Introducing: Reasoning tokens

On Thursday, we discussed the Reflection 70B drama, a model that wasn’t what it seemed. While the model was a fraud, the technique they used was legit, and o1 models confirm that.

The simplest way to describe o1 models is that they iterate over their responses, reflecting on previous answers and refining them to provide a more sound solution. This introduces us to a new type of token: reasoning tokens.

Simply put, an o1 response will be a combination of two types of tokens: reasoning tokens, tokens used to think better, and output tokens, tokens that are eventually delivered to you.

The reasoning tokens are purposely hidden from you. Even though you can see the thinking process of the LLM on your screen, that’s a summary from another model (more on that below).

To better understand this, we can use OpenAI’s own representation of this behavior:

Source: OpenAI

As you can see, once the user’s input is received, it generates a set of reasoning tokens.

Then, conditioned on the input and reasoning tokens, the model generated the output tokens, repeating the process multiple times.

But is this graph all there is to it? Obviously not; the model does this in a very unique fashion.

A Holistic Approach to Multi-step Reasoning

When I say the model literally thinks for longer, I mean it. Thus, the output of the first turn might not be the output you see, as the model acknowledges it needs to ‘think more’.

But how do they implement that?

Here’s where search comes in. The actual process looks something very similar to this:

  1. Planning: The model’s first output is a plan on how to approach the problem

  2. First search: The model looks into its inner knowledge (and, in the future, retrieves from external sources) to enrich its own context, generating multiple possible ways to approach the solution (second output)

  3. Exploration: It engages in a search over its possible solution path, iterating and back-tracking when reaching dead ends. At this step, the turns become much more complicated, and the exercise becomes one of tree search, highly akin to Monte Carlo Tree Search (MCTS), but not quite the same (more on that below).

  4. Convergence: Once it finds a solution it’s happy with, it outputs two things:

    1. The output conclusion (the immediate response you see)

    2. A summary of the thinking process

The ‘thinking summary’ and the actual response (below).

Interestingly, by asking the model to act as a cat embodying John Carmack (quite literally), one X user got the model to explain its thought process (although not fully confirmed to be identical, it’s most likely very similar).

But how does it decide what thoughts are worth pursuing and which ones not? This leads us to the most beautiful part of the process.

AlphaZero, Prover-Verifier, and the AI critic

Based on previous research, to perform step three, the exploration part, the LLM is combined with an AlphaZero-type RL algorithm and an AI critic. In other words, we equip the logic machine, the LLM, with two things:

  1. A Reinforcement Learning algorithm that, for every newly generated thought, suggests new solution paths or deeper thoughts down one particular path

  2. An AI critic that supports the thought evaluation

So, once the user sends the problem, the model makes a plan. Then, each step generates possible strategies, which are evaluated in two ways:

  • Like in AlphaGo/AlphaZero, we have a Q-function (another neural network) that ‘scores the thought’ as to how likely it is to yield a positive outcome if we go down that path (i.e. suggesting flying pigs as a reasonable means of AI transport will yield a low score if the task is to reach Singapur the fastest, while buying a ticket from your nearest airport to Singapur will have a higher score)

  • In parallel, an AI critic (probably the same LLM) reflects on its responses and assigns a score, too.

Then, the search heuristic, or method for exploring space, is most likely very similar to the one used by AgentQ, a combination of both scores. Using an ‘alpha’ parameter, we can control how much each term (q-function score or the AI critic’s score) should influence the final score of a given ‘thought’.

At the end of the day, scoring a thought is a forecasting exercise, an estimate of how likely my thought processes will take me to the correct answer. But if you’re well-versed in forecasting, you may ask, why not simply use Monte Carlo Tree Search?

MCTS is non-differentiable, making the RL process non-scalable. Therefore, I would bet a lot of money they aren’t using MCTS but taking the same approach Deepmind took with AlphaZero, having a neural network estimate the score without having to execute the actual ‘simulation into the future.’

Moving on, I discovered a golden nugget many seem to have not realized: as mentioned earlier, the summary you see after the thinking process is finished (previous chat image) is not the chain of reasoning itself but a summary provided by another model that most likely was trained using OpenAI’s prover-verifier researcha model trained to output explanations in human-understandable language. OpenAI refuses to publish that part for competitive reasons.

Now that we have the overall picture of o1 models, we must answer the real questions: Is it really that good? When does it fail? When should I not use them? Do the prompting techniques used for GPT-4o still hold?

And, finally, what’s OpenAI’s moat, and what’s coming next?

The Model is Awesome… But Not Always

Overall, it’s undeniably the state-of-the-art for reasoning problems. In fact, the difference between GPT-4o and similar models is, quite frankly, striking.

When compared in Math, code, or PhD-level science questions, both models aren’t only a dramatic improvement over GPT-4o, but they are basically human-level.

In coding, the improvements are even more dramatic, with o1 reaching the 93rd percentile in the Codeforces benchmark (better than 93% of competitive programming humans, the best of the best) when fine-tuned on competition data:

Impressively, o1-mini is actually better at coding than o1-preview despite being much smaller, proven in both Codeforces and HumanEval benchmarks:

But does this mean that AI is finally as good as PhD-level humans? Of course not. While this is impressive, it suggests that AI can perform better than humans in some challenging PhD-level tasks. But as you’ll see below, it is not generally better than these people.

In particular, we’ll look at a particular benchmark that continues to put AI to shame, even o1 models.

A Simple Yet Effective Heuristic

Acknowledged by OpenAI itself, o1 model results prove that this release was never meant to endorse o1 models as the only solution to every problem. Quite the contrary.

More importantly, how you interact with AI is drastically different from now on, which puts a new brick in the coffin of prompt engineers while making some of the hacks we have discussed these past weeks more valuable than ever.

But before we discuss how to interact with these models, we need to understand when to use them. Interestingly, o1 models perform worse than GPT-4o in some tasks, which leads us to the question:

When should I use each model? Luckily, the heuristic is pretty straightforward.

Subscribe to Full Premium package to read the rest.

Become a paying subscriber of Full Premium package to get access to this post and other subscriber-only content.

Already a paying subscriber? Sign In.

A subscription gets you:

  • • NO ADS
  • • An additional insights email on Tuesdays
  • • Gain access to TheWhiteBox's knowledge base to access four times more content than the free version on markets, cutting-edge research, company deep dives, AI engineering tips, & more