• TheTechOasis
  • Posts
  • Meta's Self-Rewarding Models & The AI Global Tax

Meta's Self-Rewarding Models & The AI Global Tax

Sponsored by

🏝 TheTechOasis 🏝

Breaking down the most advanced AI systems in the world to prepare you for your future.

5-minute weekly reads.


  • AI Research of the Week: Meta’s Bet on SuperHuman Models

  • Leaders: The Threat of a Global AI Tax

🤩 AI Research of the week 🤩

Meta, the company behind Facebook, Whatsapp, and Rayban’s Meta glasses, has announced a recent, highly promising AI breakthrough, Self-Rewarding Language Models.

Their results have allowed their LLaMa-2 70B model to surpass models like Claude 2, Gemini Pro, and GPT-4 0613, despite being at least an order of magnitude smaller.

However, that is not the true breakthrough, as these new models also show signs of being a reasonable path to achieving superhuman abilities by eliminating the human from the equation.

But what does that mean? And is that even a good thing?

Let’s find out.

The Rise of a New Alignment Method

To this day, in all frontier models like ChatGPT, or Claude, humans play a crucial role in their creation.

As explained in my newsletter from two weeks ago, the later stages of the training process of our best language models include human preference training.

In a nutshell, we make our models achieve higher utility and reduce the risk of harmful responses by teaching them to respond how a human expert would have.

The previous link goes into much more detail, but the gist is that we have to build a very expensive human preferences dataset, which essentially are two responses to any given prompt where a human expert has decided which one is better.

Source: Anthropic

This is very expensive, requiring extensive human—expert—labor.

At this point, you have to take two possible directions:

Source: DPO research paper (Rafailov et al)

  • Tangible Rewarding through Reinforcement Learning from Human Feedback (RLHF), where you build a reinforcement learning pipeline that requires a reward model (of the same size and quality as the model being trained) and use a policy optimization process where the trained model learns to maximize the reward, aka you measure the reward model’s score on your model’s responses and you train it to achieve higher scores.

  • Intrinsic Rewarding through Direct Preference Optimization (DPO), where the model optimizes against the optimal policy instead of materializing the reward. In other words, the model implicitly maximizes the rewards by explicitly optimizing the optimal policy. In simpler terms, you simply perform an algebra trick to avoid having to materialize a reward (which would require a reward model) by directly learning the policy that maximizes that reward.

Although very new, DPO is already being cited as a major breakthrough, as it shows equal or even better results than RLHF, while being dramatically cheaper and easier to build.

But Meta has taken DPO and gone a step further with the question… do we actually need humans?

The SuperAlignment Problem

Albeit their undeniable credentials, both RLHF and DPO are still bottlenecked by us, humans.

The reason is that they require the human preferences dataset, which means that these models can only aspire to be as good as the humans building the dataset.

In other words, they are constrained by the limitations of our race. Therefore, how can we align our superhuman models of our future with human signals?

Recently, OpenAI theorized that there is some potential in humans being able to align superior, superhuman models, as the weak-to-strong generalization paradigm suggests that we can still teach a model how to behave without making it dumber when ‘forcing’ it into our limitations.

However, researchers concluded that this was definitely not enough, and that we need “something else” to guarantee our superhuman models of the future don’t spiral out of control.

And that thing could be Meta’s Self-Rewarding Models, Meta’s way of saying “Humans, get out of the way”.

The Self-Improving Paradigm

In this paper, Meta suggests a new method where the models are trained using the DPO method we just talked about, while allowing the model to generate its own, non-human, rewards.

In other words, if proven at scale, it’s the best of both worlds and, unequivocally, a complete revolution.

Let’s take a look.

Source: Meta

Circling back to the two previously covered methods, RLHF is very expensive and requires a reward model.

DPO does not, but it also requires humans to define the boundaries and how the model should behave.

Also, in both cases, the rewards don’t get better over time, they are fixed.

Instead, Meta’s new iterative framework defines a training pipeline where the current model (Mt) first generates the set of responses and scores them, aka autonomously builds the preference dataset, and then using this preference dataset and DPO (meaning there’s no need for a reward model) to obtain the new, aligned model, Mt+1.

Then, they take Mt+1 and repeat the process to get a new model, Mt+2, and so on.

In other words, the model is not only becoming better aligned with each iteration (the objective all along) but it’s also learning to get better at scoring responses, which in turn explains why the model in the newer iteration is better.

In just three iterations, the fine-tuned LLaMa model was already on par with the best of the bunch, and the self-improving mechanism showed no signs of saturation.

It’s, quite literally, the first time we have seen a self-improving LLM that doesn’t require humans to dictate what’s “good”.

This is huge, as we have already seen what self-improving methods achieve, with examples like the superhuman AlphaGo completely obliterating everyone in the game of Go by playing against itself.

If we now have the power to train LLMs in self-improving pipelines, we could build superhuman language models, whatever that turns out to be.

But as with everything, there’s a trade-off.

Alienating Humans

It’s no secret that to build superhuman models, humans are the bottleneck.

However, eliminating humans from the training process and relinquishing any ‘say’ on the matter is something that needs to be carefully evaluated.

We are already extremely bad at explaining how LLMs think… but at least we have control over them.

Now, while the former issue is far from solved, we are proposing to let them decide how to optimize, possibly losing control.

This lack of control is simply frightening as the possibility of these models going rogue is far from being zero.

On the flip side, this could expand a new world of possibilities and breakthroughs that our limited minds cannot comprehend and thus not achieve without these superhuman models.

So the question is… where do we draw the line?

🫡 Key contributions 🫡

  • Meta’s Self-rewarding models are a new framework for training LLMs where the model itself dictates its rewards, iteratively improving over time while humans remain out of the loop

  • It could be the gateway for training superhuman LLMs, a feat as equally exciting as it is scary.

👾 Best news of the week 👾

🎥 Google shows off its new model Lumiere

Sponsored: The Internet is full of your personal data you aren’t even aware of. And more importantly, data brokers make money out of selling your data for marketing, recruiting, or whatever, without your consent and there’s nothing you can do about it.

Or can you?

You can choose to regain back control of your data today with this huge discount offer by Incogni.

Keep Your SSN out of criminals' hands

The most likely source of your personal data being littered across the web? Data brokers. They're using and selling your information — home address, Social Security number, phone number and more.

Incogni helps scrub this personal info from the web and gives you peace of mind to keep data brokers at bay. Protect yourself: Before the holiday spam madness gets even worse, try Incogni. I did. It took me three minutes to set up.

Get your data off 180+ data brokers' and people-search sites automatically with Incogni. They offer a full 30-day money-back guarantee if you’re not happy … But I bet you will be. Don’t wait! Use code “PRIVACY” today to get an exclusive 55% discount.

🥇 Leaders 🥇

The Threat of a Global AI Tax

A Financial Times piece by Marietje Schaake, international policy director at Stanford University’s Cyber Policy Center and special adviser to the European Commission, on the need to impose a global tax on AI and AI companies, has made headlines across the world.

Even though the idea of taxing machines is decades old, governments around the world are finally considering this option.

However, such a taxation system not only may seem extremely premature, it might be too complicated to impose.

Bad timing or execution could cripple innovation on a technology with the potential of curing all diseases, eliminating the need for jobs, and ultimately elevating human civilization to new highs.

On the flip side, AI’s threat to government revenues is already apparent, and countries around the world could soon see their piggy banks emptied because of AI.

Today, we are deep-diving into how this tax could take place, if AI is already impacting our economy, and the consequences, both positive and negative, of governments taking this extremely risky step.

So first things first, are we already feeling the effects of AI on the economy?

AI and the Current Economy

For the last year or so, AI has been top of mind for consumers, governments, and investors alike around the world.

And even though today it’s all purely based on hype, the effects are already clear.

The ‘Magnificent AI Seven’

If we look at US stock growth over the last year, it’s never been so unbalanced.

As a recent study by Goldman Sachs proves, Apple, Microsoft, Amazon, Alphabet (Google), Nvidia, Tesla, and Meta account for 29% of the total market cap of the S&P500, which is composed of the 500 largest companies in the US market.

Also, they have grown 71% over the year compared to 6% of the remaining 493 corporations.

Source: Goldman

And the reason for this is that, as Defiance ETF’s CEO explained, these 7 companies are all considered AI companies and thus have ridden the hype horse.

But stock-market wise isn’t the only way AI is already disrupting the economy.

AI and LayOffs are a thing already

As a survey by ResumeBuilder showed, 44% of CEOs are assuming AI will replace workers this year.

But the replacement is already taking place, as 37% of companies already leveraging AI acknowledged they replaced employees last year with AI because “they were no longer needed.”

From my personal experience, for instance, I have already seen big corporations openly discussing completely automating certain customer support processes with AI.

Nevertheless, a combined study by Harvard and the Boston Consulting Group, BCG, proved that consultants using ChatGPT worked 25% faster and 40% better.

Adding insult to injury, as open-source models are becoming much better and are also easier to align thanks to breakthroughs like Meta’s self-rewarding models, the temptation to fully embrace LLMs will be completely irresistible for companies during 2024.

Thus, we are only getting started. But if we look at the mid-to-long term, things get even more wild.

The Great Displacement

There’s no industry or job safe from AI, period. All of us, no matter our background or expertise, are exposed to some extent to AI.

  • McKinsey estimates 12 million occupation shifts solely in the US by 2030, and expects 30% of current worked hours to be fully automated by then.

  • OpenAI estimates that 80% of workers could see 10% of their tasks fully automated, with 19% having more than 50% of tasks gone.

Interestingly though, the higher your wage, the more exposed you are according to human and GPT-4 ratings, so this is not a blue-collar wipeout as some people put it.

Source: OpenAI

But even though AI is still portrayed as a Copilot technology, it still has great potential to massively decrease job demand... even potentially creating a jobless society.

In fact, people like Elon Musk or Sam Altman have openly discussed the real possibility of having to install a universal basic income, as the job displacement will be simply too large too quick that most people will be ‘de facto’ out of the market.

And less job demand is, put mildly, a huge problem for governments, which could force them to take desperate measures very soon.

Subscribe to Leaders to read the rest.

Become a paying subscriber of Leaders to get access to this post and other subscriber-only content.

Already a paying subscriber? Sign In

A subscription gets you:
High-signal deep-dives into the most advanced AI in the world in a easy-to-understand language
Additional insights to other cutting-edge research you should be paying attention to
Curiosity-inducing facts and reflections to make you the most interesting person in the room