• TheTechOasis
  • Posts
  • An Alignment Revolution & Our Future According to Experts

An Alignment Revolution & Our Future According to Experts

🏝 TheTechOasis 🏝

Breaking down the most advanced AI systems in the world to prepare you for your future.

5-minute weekly reads.


  • AI Research of the Week: DPO, An Alignment Revolution

  • Leaders: What the World’s Best Think About Our Future

🤩 AI Research of the week 🤩

It is only rarely that, after reading a research paper, I feel like giving the authors a standing ovation.

If this is how one of the most prominent researchers in the world, Andrew Ng, refers to a recent research paper, you know it’s awesome.

A group of researchers from Stanford and CZ Biohub have presented DPO, a new alignment breakthrough that could give back to the open-source community the capacity to challenge the big tech companies, something thought as impossible… until now.

The Rich Game

When one looks at the numbers, it’s easy to realize that building the best Large Language Models (LLMs) like ChatGPT is a rich people’s game.

Building ChatGPT

The current gold standard is as follows:

Source: Chip Huyen

You first assemble billions of documents with trillions of words and, in a self-supervised manner, you ask the model to predict the next token (word or sub-word) to a given sequence.

Next, you want to teach the model to behave in a certain way, like being able to answer questions when prompted.

That takes us to step 2, where we assemble a dataset where the model performs the same exercise, but on a curated, dialogue-based form, becoming what we define as assistants.

But we not only want the model to behave in a certain way, we also want it to maximize the quality and safety of its responses.

Reinforcement Learning from Human Feedback, or RLHF, steps 3 and 4, does exactly that by teaching the model to not only answer as expected but to give the best and safest possible answer aligned with human preferences.

However, there’s a problem.

Money goes brrrr

Put simply, RLHF is very, very expensive.

It involves three steps:

  1. Building a preference dataset. By prompting the model in step 2, they build a dataset of {prompt, response 1, response 2}, where a group of humans decides which response is better, 1 or 2.

  2. Training a reward model. Step 3 in the previous diagram, you build a model that learns to predict, for every new response, how good is it. As you may imagine, we use the preference dataset to train this model, by training it to assign a higher score to the preferred response out of every pair.

  3. Maximize reward. Finally, in step 4 of the diagram, we train the model to maximize the reward. In other words, we train our model on a policy that learns to obtain the highest rewards from the reward model. As this model represents human preferences, we are implicitly aligning our model to those human preferences.

In layman’s terms, you are training a model to find the ideal policy that learns to take the best action, in this case word prediction, for every given text.

Think of a policy as a decision-making framework, the model learns to take the best action (word prediction) to a given state (prompt).

And that gives us ChatGPT. And Claude. And Gemini. But with RLHF, costs get real.

Even though the first two steps before RLHF already consume millions of US dollars, RLHF is just prohibitive for the majority of the world’s researchers, because:

  • It involves building a highly curated dataset that requires extensive model prompting and also the collaboration of expensive human experts.

  • It requires training an entirely new model, the reward model. This model is often as big and as good as the model we are aligning, doubling the compute costs.

  • And it involves running the soon-to-be-aligned model and reward model in an iterative Reinforcement Learning training cycle.

Put simply, unless you go by the name of Microsoft, Anthropic, Google, and a few others, RLFH is way out of your league.

But DPO changes this.

Keep it Simple, Stupid

Direct Preference Optimization (DPO) is a mathematical breakthrough where the trained model is aligned to human preferences without requiring a Reinforcement Learning loop.

In other words, you optimize against an implicit reward without explicitly materializing that reward. Without materializing a reward model.

But before answering what the hell that means, we need to review how models learn.

It’s just trial and error!

Basically, all neural networks, be that ChatGPT or Stable Diffusion, are trained using backpropagation.

In succinct terms, it’s no more than glorified trial and error.

You define a computable loss that tells the model how wrong its prediction is, and you apply undergraduate-level calculus to optimize the parameters of the model to slowly minimize that loss using partial derivatives.

If we think about ChatGPT and its next token prediction task, as in training we know what the predicted word should be, our loss function is computed as the probability the model gives to the correct word out of all the words in its vocabulary.

For instance, if we know the next word should be “Cafe” and the model has only assigned a 10% probability to that word, the loss is quite big.

Source: @akshar_pachaar (X.com)

Consequently, the model slowly learns to assign the highest probability possible to the correct word, thereby learning to efficiently model language.

Therefore, looking at the earlier four-step diagram, in steps 1 and 2 the model learns just how we just explained, and in the case of RLHF, the loss function essentially teaches the model to maximize the reward.

Specifically, the loss function in RLHF looks something like this:

where the first term r(x,y) computes the reward given by the reward model.

And what’s the term substracting the reward?

The RLHF loss function also includes a regularization term to avoid the model drifting too much from the original model.

So, what makes DPO different from RLHF?

Algebra comes to our aide

The key intuition is that, unlike RLHF, DPO does not need a new model—the reward model—to compute the alignment process.

In other words, the Language Model you are training is secretly its own reward model.

Source: Stanford

Using clever algebra and based on the Bradley-Terry preference model—a probability framework that essentially predicts the outcomes of pair-wise comparisons—they implicitly define the reward and train the LLM directly without requiring an explicit reward model.

Although the DPO paper gives the complete mathematical procedure, the key intuition is that the process goes from:

  • Training an LLM » define a preference dataset » training a reward model » training the LLM to find the optimal policy that maximizes the reward, to:

  • Training an LLM » define a preference dataset » training the LLM to find the optimal policy.

But how do we compute the reward in DPO?

Fascinatingly, the reward is implicitly defined as part of the optimal policy.

In other words, what DPO proves is that, when working with a human preference dataset, we don’t need to first create a reward model that predicts what a human would choose and then use that model to optimize our goal model.

In fact, we can find the optimal policy that aligns with our model without calculating the actual reward, because the optimal policy is a function of the reward, meaning that by finding that policy we are implicitly maximizing the reward.

Bottom line, you can think of DPO as a cool algebra trick that skips calculating the reward explicitly by directly finding the policy that implicitly maximized the reward.

This gives us the following loss function:

Where yw and yl stand for the winning and losing response in a given comparison.

The intuition is that the higher the probability the policy gives to the preferred response, and the lower the assigned probability to the losing response, the smaller the loss.

2024, the Year of Efficiency

We aren’t even halfway through January and we have already seen the disruption of one of the most painful and expensive, yet essential steps in the creation of our best models.

Unequivocally, DPO levels the playing field, allowing universities and small-time research labs to build models that can be aligned with orders of magnitude lower costs.

Open-source can finally compete. So buckle up, Sam.

🫡 Key contributions 🫡

  • DPO promises an elegant way of aligning models to human preferences in a cost-efficient way, eliminating the moat private companies had in the field of alignment

  • Open-source communities will now have the tools to develop well-aligned models, making open-source models more appealing than ever to companies.

👾 Best news of the week 👾

🥇 Leaders 🥇

The Opinions of the World’s Best on AI

The world’s biggest AI survey done on experts has been performed.

It gives an amazing view of what some of the most prominent researchers in the world, most of them with multiple papers published in some of the most prestigious AI venues like NeurIPS, have on the future of AI.

We will cover what jobs are most likely to disappear first, when AI will surpass human prowess in all tasks, and the experts’ opinions on some of the most dangerous, extinction-level risks.

The scale and expertise of the respondents make this survey a crucial reference point for understanding current trends and future directions in AI research.

Subscribe to Leaders to read the rest.

Become a paying subscriber of Leaders to get access to this post and other subscriber-only content.

Already a paying subscriber? Sign In

A subscription gets you:
High-signal deep-dives into the most advanced AI in the world in a easy-to-understand language
Additional insights to other cutting-edge research you should be paying attention to
Curiosity-inducing facts and reflections to make you the most interesting person in the room