TheTechOasis
Posts
OpenAI has Shown Us the Next Design Paradigm for ChatGPT

OpenAI has Shown Us the Next Design Paradigm for ChatGPT

Ignacio de Gregorio Noblejas
June 25, 2023

🏝 TheTechOasis 🏝

Although Generative AI is all the buzz and something thought of as new thanks to the release of ChatGPT back in November, it’s actually not that new.

In fact, Large Language Models (LLMs) have been around since June 2018, when OpenAI released GPT-1.

However, it took them more than four years to launch ChatGPT, the first commercial release for GenAI.

But why?

Simple, because another innovation was required to bring GenAI into the masses, and OpenAI was the first to perfect it.

Now, they have just released the evolution of that breakthrough, and the results it shows turn ChatGPT, the world’s most advanced AI chatbot that sucks big time with maths, into a world-class mathematician.

The Half-True Definition of LLMs

When people refer to ChatGPT, they describe it as an AI model that predicts the next word in a sequence.

And I’m one of those guys when I want to keep the definition short.

However, that’s a nuanced statement, because if we want to be purists, that’s the definition of a base LLM.

ChatGPT is much more than that, as when you just train a model to predict the next word in a sequence, it’s basically useless.

For instance, let’s say you ask it “What’s the capital of France?”

ChatGPT automatically responds “Paris”, but this base model could respond with “What’s the capital of France riddle”.

Wait, what? How does that make sense?

It’s actually pretty darn simple:

A base model has been trained with one simple task, predicting the next word.

That’s a simple probabilistic calculation; the model is simply giving you back the most probable answer.

And as these models are trained by handing them a huge part of the Internet’s text, the response “What’s the capital of France riddle” will probably be just as common as “Paris”, because the Internet is far from being a perfectly-written novel all along.

And what’s even more worrying, base models have no filter whatsoever, and they easily became racist, homophobic, and hate-speech machines, making the interacting experience a very unpleasant one.

So what did OpenAI do to turn this unusable monster into the ChatGPT the world marvels upon?

Humans are the Answer

Leveraging a discovery by Paul Christiano et al, they decided to train the model to output its responses in the best way possible.

The process was simple.

They would train another model to score the outputs given by the base model.

To do so, they assembled a team of humans that sampled the base model’s responses (when ChatGPT was first released, this base model was GPT-3.5) multiple times for a specific question/instruction.

Then, those human engineers would score the responses, so that it was clear which answer was most useful, or the least racist, for instance.

With that, they trained a reward model that learned to predict, for a set of responses to a particular instruction, which one would have received a higher score by the human engineers.

That way, this model became an artificial representation of these experts.

Next, they used this reward model to train the base model, a process they defined as Reinforcement Learning from Human Feedback or RLHF.

And that, my friends, created ChatGPT.

But now, OpenAI has presented us with what they think is the next evolution of these reward models.

And the results are amazing.

The Outcome Problem

The reward models I explained earlier are defined as outcome-supervised reward models (ORMs).

In layman’s terms, they are models trained to only review the final result of the model.

In other words, they don’t pay attention to the whole response process, just to the final answer.

However, when dealing with complex tasks like solving a math equation, there’s a huge chance that the LLM messes up some step/s in the process, completely trashing the final response.

Or what’s even worse, they somehow might get to the final response, making the reward model accept it as ok, while the intermediate steps might be wrong (sounds weird but it actually happens in the world of LLMs).

Long story short, although ChatGPT has still managed to become the best AI chatbot in the world, it completely sucks in complex tasks like mathematics.

Therefore, what OpenAI is proposing is, “Why not train a model to pay attention not only to the final result but also to the process in between?”

And that leads us to Process-supervised Reward Models (PRMs).

Don’t check, verify!

According to OpenAI, PRMs are extremely promising for LLMs, as they allow to train base models to generate not only responses that are useful to humans, but also force the model to process and solve complex tasks like humans do too.

As you can see in the image below, the human engineers not only reward/punish good/bad final responses, but they also reward/punish intermediate responses, which forces the base model not only to mimic human-like answers but also the reasoning processes.

And the results?

While our current ChatGPT sucks when solving complex math equations, GPT-4 trained with a PRM achieves almost 80% in the complex MATH dataset, and it’s capable of doing stuff like the image below:

I mean, while ChatGPT was not capable a few months ago of adding 4 consecutive numbers, a GPT-4 base model fine-tuned with a PRM is literally quoting mathematical theorems like the Sophie Germain identity.

That’s orders of magnitude improvement simply by understanding the importance of RLFH to the AI chatbot pipeline and optimizing it to new levels.