• TheTechOasis
  • Posts
  • Microsoft Changes the Game & ORPO, The New Training Standard?

Microsoft Changes the Game & ORPO, The New Training Standard?

In partnership with

🏝 TheTechOasis 🏝

part of the:

Welcome to the newsletter that keeps you updated on the latest developments at the cutting edge of AI by breaking down the most advanced systems in the world & the hottest news in the industry.

10-minute weekly reads.

🚨 Week’s Update 🚨

From war to healthcare, there’s a lot to unpack this week.

To start, the US Government has announced a world’s first in the aerospace industry: autonomous AI-controlled F-16 fighter aircraft. The fighter was capable of engaging in visual-range combat scenarios, known as ‘dogfighting’, against a human.

🏝️ This is just another step toward AI-driven wars, after Iran’s attack on Israel was purely based on drones. AI should minimize human casualties in open war, but opens the pandora box of how dangerous can AI be.

This has reminded me of Palantir’s amazing demo on AI-assisted warfare.

Moving on, OpenAI has launched a new set of enterprise features, like project-based control, the famous Batch API for asynchronous inference (use cases where you don’t need an immediate response but a 24-hour margin with a much desirable economic discount), or streaming-based responses.

🏝️ OpenAI’s move comes in handy as their technological offering is not the best anymore, with Claude 3 pushing frontier models forward, and Meta’s Llama 3 or Microsoft Phi-3 (more on the latter below) pushing the boundaries of what Small Language Models (SLMs) can do, cost-effective alternatives to the big guys.

In same fashion, Perplexity has just extended its enterprise offering to revamp enterprise search with AI. They presented this announcement with a stacked set of enterprise customers ranging from Stripe to the fricking Cleveland Cavaliers. Read the CEO’s post for more details.

At the product level, few products in Generative AI history have looked as promising as Adobe’s Premier Pro (video included).

Excitingly, it’s the first successful embedding of OpenAI’s text-to-video model Sora into a fully-fledged product, to extend videos with AI-generated frames, but also includes other impressive features like image inpainting (object add/removal). It will offer other models like Pika.

They also announced their newest image generator model Adobe Firefly. The company seems to be on a complete tear lately!

In market land, Chinese investors are going crazy for SenseTime share, growing upwards of 30%, as they have presented the SenseNova 5.0 model after the CEO claimed ‘it was better than GPT-4’.

This is a stark reminder for the US that China is dead serious about their AI efforts. In fact, as the New York Times recently published, they are actually beating the US in the most critical metric: talent.

The US has one critical advantage though, capital.

Moving on, in parallel to their release of Q1 2024 results, Tesla has teased how the future autonomous ride-hailing app they are working on will look like.

As the EV market seems to be on a race to the bottom price-wise, Tesla seems to be betting its future on AI, with the Robotaxi announcement and their more than obvious merge between Tesla’s Full Self-Driving Mode software and xAI’s Grok 1.5V capabilities, as we discussed last week.

Lastly, in yet another chapter of ‘AI will take my job’ OpenAI’s GPT-4 is already almost as good as expert ophthalmologists in deciding how to treat patients with eye problems, while considerably outperforming junior and trainee doctors.

AI has great potential to make the US healthcare system, broken as few things in this world, a much more affordable industry for most americans.

And in Europe, where healthcare is most-often a public service (free) it can also be an extremely positive factor, as healthcare spending accounts almost double digits of the country’s GDP, on average.

🧐 Research to Pay Attention to 🧐

  • Phi-3, An SLM as Good as ChatGPT?

  • ORPO, the New LLM Training Standard?

📈 Phi-3, The Great Enterprise Shift 📈

Tiny but mighty. That’s how Microsoft has summarized their brand new generation of Phi models, announced yesterday.

They have some serious claims, like the fact that their new minute models are reaching the performance of ChatGPT-3.5 or Mixtral 8×7B despite being so small they can be deployed on a smartphone.

This news is just a new stepping stone on AI’s path to pragmaticism, opening an era where on-device LLMs and AI edge computing are finally serious options for enterprises.

Also, it seems to break a historical AI pattern.

Who Cares About Scaling Laws

In frontier AI, one law rules over everything else. In fact, it’s the law underpinning a huge chunk of the 50 billion dollars invested in AI just last year alone:

Scaling laws. Or, in less jargon-like terms, the bigger, the better.

No end in sight

Years after the discovery of the Transformer, the architecture underpinning most cutting-edge models, one principle continues to be true:

As models get bigger, they get ‘smarter’.

In more technical terms, the average perplexity, the measurement of how insecure a model is toward a certain next-word prediction, continues to drop as models get bigger.

And this is unequivocally true. Yet, even though that metric alone can justify the immense investments from labs like OpenAI, Anthropic, or Google, it begs the question:

Do we actually need to always follow this law? Well, as Microsoft has proven, hard no.

Striking the Right Balance

One thing is to try to create Artificial General Intelligence, or AGI, and another one is to put Generative AI to use.

Nonetheless, most consumer or enterprise use cases do not require state-of-the-art AI to be effective, especially the latter. When deploying AI, you want to find the right balance between a model that is great but at the same time pragmatic.

For example, if I intend to use an LLM as a voice assistant from my smartphone, an LLM that will come into touch with some of my most private data, I will prefer it to be “on-device”, aka offline.

For that reason, Small Language Models, or SLMs, have become the most attractive field of development in AI when it comes to ‘putting AI to use’.

And amongst the incumbents, few stand to benefit the most from this trend than Microsoft.

The Great Small Model

The Phi-3 family has yielded quite amazing results. One could claim they are extraordinary.

For starters, the smallest range in the family, Phi-3 Mini with 3.8 billion parameters, yields better performance on MMLU than LLaMa 3 8B-Instruct despite being two times smaller (I guess Meta has lost its crown in the SLM field pretty rapidly) and inches close to Mixtral 8×7B, a 45 billion parameter model more than 10 times larger.

The MMLU benchmark evaluates the model's ability to understand and generate language in various contexts, including machine translation, text summarization, and sentiment analysis.

It’s one of the most used benchmarks to compare models.

But if we move into the 7B and 14B Phi models, things get even crazier, as the former is at the ChatGPT-3.5 level (a model that would have at least 175 billion parameters, probably many more) and the latter is objectively superior.

But how can we explain these amazing results?

The Power of Data

Simply put, they went against scaling laws.

With scaling laws, as size goes before anything else, there’s basically no data quality assurance at the pre-training stage. In other words, they feed models every data point they find.

But with the Phi models, Microsoft has prioritized data quality, theorizing that when training only with extremely curated data, the results obtained are similar to what larger models obtain with uncurated data.

That way, instead of building larger and larger models, with better-quality data you can obtain outsized results at a much smaller scale.

What doesn’t change, no matter the approach, is the amount of data. In the case of the 3.8 billion model, they trained it with 3.3 trillion tokens. This is 16 times what is considerd ‘compute optimal’ for a 10B model and in line with Meta’s insane data-intensive training for Llama 3 (15 trillion in their case).

What We Think

Microsoft’s Phi-3 results aren’t promising, they are simply incredible.

Just close to a year and a half after the release of ChatGPT-3.5 (the original ChatGPT), we have models at the 3-15 billion parameter range at the same level or better while being at least an order of magnitude smaller.

On the flip side, while models might be getting smaller when it comes to deploying AI, data sizes are growing exponentially, with several orders of magnitude more training data being used compared to a few years ago.

Among those who stand to benefit the most are Microsoft, Apple, and Meta, all of which have a considerable size of their business dependent on edge hardware like laptops or smartphones.

Specifically, Apple, a hardware company huge on data privacy, seems to be salivating at the idea that AI might be shifting to a more practical form, where Language Models become prominent “on-device” software, having just released themselves a new set of SLMs, Open ELM.

📚 Sponsor of the Week 📚

85% of all AI Projects Fail, but AE Studio Delivers

If you have a big idea and think AI should be part of it, meet AE.

We’re a development, data science and design studio working with founders and execs on custom software solutions. We turn AI/ML ideas into realities–from chatbots to NLP and more.

Tell us about your visionary concept or work challenge and we’ll make it real. The secret to our success is treating your project as if it were our own startup.

🏋🏽‍♀️ ORPO, the New LLM Training Standard? 🏋🏽‍♀️

Coming fresh out of South Korea, a team of researchers has presented a new training method for Large Language Models, named ORPO, that offers increased efficiency in terms of computation and, importantly, seems to create better-performant models.

If true, we could see a massive shift in the way we train LLMs. But first, how do we train actually train them?

From Two to Three, and Back to Two

For starters, it’s important to note that LLMs are neural networks (NNs) and, for that reason, they have a very specific training method: glorified trial and error.

How NNs Learn

To train an LLM, we compare its responses to the ground truth (the correct answer). This gives us a ‘signal’ on how well the NN is performing so that we can adapt its parameters to minimize that loss.

In the case of LLMs, we want them to predict the next word in a sequence. Thus, for a given prediction the model outputs the full list of possible answers (words) each with its assigned probability.

Then, we measure the loss using the cross-entropy function below, where yi is a boolean with a 1 over the ground-truth word and a 0 on the rest, and ‘yhati’ is the probability the model assigned to the correct word.

Simply put, this function just looks at the probability the model gave to the correct word and forgets about the rest. Thus, it maximizes the probability the model assigns to the right word.

Consequently, if we tune the parameters of the model to minimize this loss, we are implicitly training the model to become an accurate next-word predictor.

This single equation above has dominated the entire industry for years now, as all LLMs, and I mean all, are trained by optimizing the model against this function.

Well, ORPO’s researchers are calling bullshit on this approach. But before we know why, we need to understand the whole process.

The LLM Training Pipeline

Training an LLM is a two-step process that’s actually a three-step process yet sometimes four-step process that ORPO aims to perform in two steps.

Sounds like a tongue-twister but it’s not. These are:

  1. Pre-training stage: The model is fed with all the data that is humanly possible to ingest. Here the model learns to accurately predict the next word in a sequence, but it’s incapable of following instructions.

  2. Fine-tuning stage - Supervised Fine-tuning: Here, the model is trained using a supervised training method to follow instructions, hence why you will often see models at this stage with names such as “Instruct-GPT” (OpenAI)

  3. Fine-tuning stage - Alignment: Here the model undergoes safety training. While remaining helpful (step 2), it learns to not respond when the response can be dangerous. Models are trained using Reinforcement Learning from Human Feedback (RLHF), DPO, or both, as Llama 3 researchers did.

In the third stage, the function changes. As mentioned, in steps 1 and 2 we want the model to learn to predict the next word. But in step 3, we want the model to improve decision-making. For that, the model is given two accurate responses to a prompt, with one being ‘better’ than the other one, and the model has to learn to choose the correct one.

To optimize against pair-wise responses to a given prompt, we usually use the Bradley-Terry model.

Overall, for big models, this takes months and millions of US dollars, sometimes in the 9-figure range. Despite this, the pipeline has remained untouched for years.

Until now.

ORPO, Throwing Steps Out of the Window

From above, it becomes really hard to understand why we divide the fine-tuning stage into steps 2 and 3. Aren’t both modeling the model’s behavior?

Strangely Resilient

While Step 3 is probably more important, Step 2 has been surprisingly resilient. Many teams have tried to discard the Supervised Fine-tuning stage, but the results obtained from keeping it always indicate its key role in the process.

But now, ORPO researchers seem to have figured out a solution: doing steps 2 and 3 together.

If we trace back to the objective function that the LLMs optimize against to train, we realize that while this function indeed maximizes the likelihood of assigning the highest possible probability to the correct word, it completely forgets about the rejected responses, as the yi' boolean term marks as zero every other response.

To prove it, they monitored both probabilities, the ones assigned to the correct response and to the rejected ones, and saw that both increased over time.

And this… is a problem. While the model’s accuracy improves as it consistently chooses the correct response, it implicitly increases the likelihood of undesired responses too, making the need for an extra stage, alignment, completely inevitable.

So what did they do? Well, they introduced a penalty term.

From 2 Steps to 1

Unlike the previous cross-entropy equation, ORPO proposes a new objective function, where LSFT is the original cross-entropy we saw earlier, and LOR is the odds ratio between the chosen and rejected responses.

Hence, if we recall that we train NNs by adapting their parameters so that the term LORPO is smaller over time:

  • While LSFT increases the likelihood of choosing appropriate responses

  • LOR decreases the likelihood of choosing bad responses

Intuitively, the odds ratio indicates how much more likely it is for the model to generate the preferable response over the rejected one. Thus, maximizing this likelihood (which in loss terms implies miniziming LOR) has that precise effect.

And with this, researchers conclude that the SFT phase (Step 2) is no longer required, allowing for a much improved computational efficiency and, interestingly, better overall results, as models trained with this method improve over models trained with traditional pipelines.

What We Think

After several years following the same training pipeline, ORPO seems to hit the right notes and presents itself as the first method to eliminate the SFT phase for good.

ORPO training is already available in HuggingFace’s TRL library

Without the SFT phase, you avoid having to instantiate an additional model to use as a baseline, which saves a lot of money and time.

What’s more, the outcome seems to be superior, meaning that the penalty term over bad responses is doing an awesome job while implicitly modeling behavior (the actual job of the SFT phase) all in one go.

With ORPO, we might have a new training standard.

Do you have any feelings, questions, or intuitions you want to share with me? Reach me at [email protected]