- TheTechOasis
- Posts
- An Alignment Revolution & Our Future According to Experts

# An Alignment Revolution & Our Future According to Experts

**š TheTechOasis š**

**Breaking down the most advanced AI systems in the world to prepare you for your future.**

**5-minute weekly reads.**

## TLDR:

**AI Research of the Week:**DPO, An Alignment Revolution**Leaders:**What the Worldās Best Think About Our Future

# š¤© AI Research of the week š¤©

ā*It is only rarely that, after reading a research paper, I feel like giving the authors a standing ovation.*ā

If this is how one of the most prominent researchers in the world, * Andrew Ng*, refers to a recent research paper, you know itās awesome.

A group of researchers from Stanford and CZ Biohub have presented * DPO*, a new alignment breakthrough that could give back to the open-source community the capacity to challenge the big tech companies, something thought as impossibleā¦ until now.

**The Rich Game**

When one looks at the numbers, itās easy to realize that building the best Large Language Models (LLMs) like ChatGPT is a rich peopleās game.

**Building ChatGPT**

The current gold standard is as follows:

Source: Chip Huyen

You first assemble billions of documents with trillions of words and, in a self-supervised manner, you ask the model to predict the next token (word or sub-word) to a given sequence.

Next, you want to teach the model to behave in a certain way, like being able to answer questions when prompted.

That takes us to step 2, where we assemble a dataset where the model performs the same exercise, but on a curated, dialogue-based form, becoming what we define as * assistants*.

But we not only want the model to behave in a certain way, we also want it to maximize the quality and safety of its responses.

**R**einforcement **L**earning from **H**uman **F**eedback, or **RLHF**, steps 3 and 4, does exactly that by teaching the model to not only answer as expected but to give the best and safest possible answer aligned with human preferences.

However, thereās a problem.

**Money goes brrrr**

Put simply, RLHF is very, very expensive.

It involves three steps:

**Building a preference dataset**. By prompting the model in step 2, they build a dataset of*{prompt, response 1, response 2}*, where a group of humans decides which response is better, 1 or 2.**Training a reward model.**Step 3 in the previous diagram, you build a model that learns to predict, for every new response, how good is it. As you may imagine, we use the preference dataset to train this model, by training it to assign a higher score to the preferred response out of every pair.**Maximize reward.**Finally, in step 4 of the diagram, we train the model to maximize the reward. In other words, we train our model on a policy that learns to obtain the highest rewards from the reward model. As this model represents human preferences, we are implicitly aligning our model to those human preferences.

In laymanās terms, **you are training a model to find the ideal policy that learns to take the best action, in this case word prediction, for every given text.**

*Think of a policy as a decision-making framework, the model learns to take the best action (word prediction) to a given state (prompt).*

And that gives us * ChatGPT*. And

*. And*

**Claude***. But with RLHF, costs get real.*

**Gemini**Even though the first two steps before RLHF already consume millions of US dollars, RLHF is just prohibitive for the majority of the worldās researchers, because:

It involves building a highly curated dataset that requires extensive model prompting and also the collaboration of expensive human experts.

It requires training an entirely new model, the reward model. This model is often as big and as good as the model we are aligning, doubling the compute costs.

And it involves running the soon-to-be-aligned model and reward model in an iterative Reinforcement Learning training cycle.

Put simply, unless you go by the name of * Microsoft*,

*,*

**Anthropic***, and a few others, RLFH is way out of your league.***Google**

But * DPO* changes this.

**Keep it Simple, Stupid**

* Direct Preference Optimization (DPO)* is a mathematical breakthrough where the trained model is aligned to human preferences without requiring a Reinforcement Learning loop.

In other words, you optimize against an implicit reward without explicitly materializing that reward. Without materializing a reward model.

But before answering what the hell that means, we need to review how models learn.

**Itās just trial and error!**

Basically, all neural networks, be that * ChatGPT* or

*, are trained using*

**Stable Diffusion***.*

**backpropagation**In succinct terms, itās no more than glorified trial and error.

You define a computable loss that tells the model how wrong its prediction is, and you apply undergraduate-level calculus to optimize the parameters of the model to slowly minimize that loss using partial derivatives.

If we think about ChatGPT and its next token prediction task, as in training we know what the predicted word should be, our loss function is computed as the probability the model gives to the correct word out of all the words in its vocabulary.

For instance, if we know the next word should be āCafeā and the model has only assigned a 10% probability to that word, the loss is quite big.

Source: @akshar_pachaar (X.com)

Consequently, the model slowly learns to assign the highest probability possible to the correct word, thereby learning to efficiently model language.

Therefore, looking at the earlier four-step diagram, in steps 1 and 2 the model learns just how we just explained, and in the case of RLHF, the loss function essentially teaches the model to maximize the reward.

Specifically, the loss function in RLHF looks something like this:

where the first term r(x,y) computes the reward given by the reward model.

**And whatās the term substracting the reward?**

*The RLHF loss function also includes a regularization term to avoid the model drifting too much from the original model. *

So,* what makes DPO different from RLHF?*

**Algebra comes to our aide**

The key intuition is that, unlike RLHF, DPO does not need a new modelāthe reward modelāto compute the alignment process.

In other words, the Language Model you are training is secretly its own reward model.

Source: Stanford

Using clever algebra and based on the * Bradley-Terry* preference modelāa probability framework that essentially predicts the outcomes of pair-wise comparisonsāthey implicitly define the reward and train the LLM directly without requiring an explicit reward model.

Although the DPO paper gives the complete mathematical procedure, the key intuition is that the process goes from:

Training an LLM

**Ā»**define a preference dataset Ā» training a reward model**Ā»**training the LLM to find the optimal policy that maximizes the reward, to:Training an LLM

**Ā»**define a preference dataset**Ā»**training the LLM to find the optimal policy.

*But how do we compute the reward in DPO?*

Fascinatingly, the reward is implicitly defined as part of the optimal policy.

In other words, what DPO proves is that, when working with a human preference dataset, we donāt need to first create a reward model that predicts what a human would choose and then use that model to optimize our goal model.

In fact, we can find the optimal policy that aligns with our model without calculating the actual reward, because the optimal policy is a function of the reward, meaning that by finding that policy we are implicitly maximizing the reward.

Bottom line, you can think of DPO as a cool algebra trick that skips calculating the reward explicitly by directly finding the policy that implicitly maximized the reward.

This gives us the following loss function:

Where * yw* and

*stand for the winning and losing response in a given comparison.*

**yl***The intuition is that the higher the probability the policy gives to the preferred response, and the lower the assigned probability to the losing response, the smaller the loss.*

**2024, the Year of Efficiency**

We arenāt even halfway through January and we have already seen the disruption of one of the most painful and expensive, yet essential steps in the creation of our best models.

Unequivocally, DPO levels the playing field, allowing universities and small-time research labs to build models that can be aligned with orders of magnitude lower costs.

Open-source can finally compete. So buckle up, Sam.

### š«” Key contributions š«”

**DPO**promises an elegant way of aligning models to human preferences in a cost-efficient way, eliminating the moat private companies had in the field of alignmentOpen-source communities will now have the tools to develop well-aligned models, making open-source models more appealing than ever to companies.

### š¾ Best news of the week š¾

### š„ Leaders š„

**The Opinions of the Worldās Best on AI**

The worldās biggest AI survey done on experts has been performed.

It gives an amazing view of what some of the most prominent researchers in the world, most of them with multiple papers published in some of the most prestigious AI venues like NeurIPS, have on the future of AI.

We will cover what jobs are most likely to disappear first, when AI will surpass human prowess in all tasks, and the expertsā opinions on some of the most dangerous, extinction-level risks.

The scale and expertise of the respondents make this survey a crucial reference point for understanding current trends and future directions in AI research.

## Subscribe to Full Premium package to read the rest.

Become a paying subscriber of Full Premium package to get access to this post and other subscriber-only content.

Already a paying subscriber? Sign In.

#### A subscription gets you:

- ā¢ NO ADS
- ā¢ Get high-signal insights to the great ideas that are coming to our world
- ā¢ Understand and leverage the most complex AI concepts in the world in a language your grandad would understand
- ā¢ Full access to all TheWhiteBox content on markets, cutting-edge research, company and product deep dives, & more.