• TheTechOasis
  • Posts
  • Microsoft's Orca2 and Q*, the Secret that Almost Killed OpenAI

Microsoft's Orca2 and Q*, the Secret that Almost Killed OpenAI

šŸ TheTechOasis šŸ

Breaking down the most advanced AI systems in the world to prepare you for your future.

5-minute weekly reads.

TLDR:

  • AI Research of the Week: Orca2, an Amazing New Type of LLM

  • Leaders: Q*, the Secret Model that Almost Killed OpenAI

šŸ¤Æ AI Research of the week šŸ¤Æ

As we discussed the previous week, Small Language Models (SLMs) are all the rage right now.

Also, Microsoft launched the new version of their SLM crown jewel, Orca2, creating a new type of language model, Cautious Reasoners.

They set a new threshold in the AI industry by beating models up to 10 times larger in highly complex reasoning tasks.

Also, they have provided a clear view of their AI strategy, as well as invaluable insights into the complex world of Transformer learning.

And today we are going to deep dive into how they created this new paradigm.

The Imitation Game

When Microsoft presented Orcaā€™s first version, the first open-source model that was truly at the level of ChatGPT-3.5, the AI industry finally started paying attention to smaller models.

Today, Orca is not only viewed as a critical innovation, but itā€™s also a centerpiece in Microsoftā€™s strategy, as it is rumored that the LLM running behind Microsoftā€™s Copilots, the talk of the town in the industry, is not ChatGPT but Orca, due to the insane costs of running +100 billion parameter models.

Microsoftā€™s premise is simple, if we can get a model that gives us 90% of the larger modelā€™s capabilities at a 10x fraction of the cost, we are taking that route.

But you may wonder, how are we building models ten times smaller than the big guys while having most of their capabilities?

And the answer to that is distillation.

Teach me the ways, ChatGPT

Far and wide, the most common way of training small language models is through distillation, the imitation game.

First and foremost, thereā€™s an inescapable truth that as language models grow bigger, they become better as they learn new capabilities.

Research from Google and Anthropic prove that LLMs show no signs of saturation. In other words, no one has reached the point LLMs stop learning as they grow in size.

Consequently, as neural networks can learn anything if provided with sufficient data, researchers realized: What if instead of training a model to learn to model language by itself, we taught it to copy another model?

This is what we call distillation, the training process where a student model learns to imitate a teacher model by learning the data distribution of its responses. In other words, to learn to imitate it.

But how does this work?

Taking as an example LLMs like ChatGPT, they output the probability distribution over the next token.

In laymanā€™s terms, given a text sequence, they output the word/subword with the highest probability of being ā€˜adequateā€™ out of the complete LLM vocabulary.

For instance, for the text sequence ā€œThe UKā€™s capital is Londonā€, we maskā€”hideā€”the word ā€œLondonā€ and the modelā€™s objective is to, among all the words in its vocabulary, assign the highest probability to ā€œLondonā€.

But in the distillation process, we add an extra step. The student not only has to learn this very same procedure, but also needs to learn to match its outputs to the teacherā€™s.

Put simply, by collecting several examples like the one above from the teacher, the student learns to model language and to imitate the teacher.

Brilliant!

But thereā€™s a catch.

In standard distillation, the student learns to imitate style and eloquence, but when prompted with reasoning tasks, pails in comparison.

Itā€™s like if you learn the answer to a math problem by heart. If you donā€™t understand it and simply memorize it, the moment things get complex you fail.

But then came Orcaā€™s first version.

Imitating reasoning

To salvage this problem, Microsoft introduced explanation tuning.

Here, when prompting the teacher model to create the distillation dataset, researchers asked the teacher to give full details of its reasoning.

Comparing the earlier image to this one, here researchers added a system instruction, where they ask the teacher modelā€”GPT-4 in this caseā€”to explain its reasoning.

And with this, Orca became the first model to beat a large frontier model, GPT-3.5, in almost every aspectā€¦ despite being 10 times smaller.

But still, this process is ā€œsub-optimalā€, as the very same researchers claim, meaning that the model still has a hard time with certain tasks.

To solve this, they created the Orca2, the cautious reasoner.

Think before you type

Even though explanation tuning helped Orca imitate GPT-4ā€™s reasoning capabilities, it didnā€™t come close in several aspects.

So this time researchers didnā€™t simply teach Orca to reason like GPT-4, they taught it to approach problem-solving similarly by inducing this thought process with prompt erasing.

Masking the thought process

LMs are heavily influenced by the given prompt and the approach to solving a problem. Even with the best models, they can get a problem right or wrong depending on the approach they use.

Consequently, if we want a much smaller model to improve, we need to find a way to induce the teacherā€™s capacity to choose the correct problem-solving strategy for the student.

Therefore, when creating the synthetic dataset of GPT-4 prompts to train the new Orca, they asked it to explain its response and use the best problem-solving strategy for each task (step-by-step, explain-then-answer, direct answer, etc.).

Hence, they assembled a huge dataset of GPT-4 problem-solving examples with different strategies, using the following structure:

  • System instruction: How the model needs to behave when answering (the strategy)

  • User Prompt

  • Output

An example system instruction.

But this time, during training, the researchers hid the system instruction from the student, what we define as prompt erasing, and ā€œforcingā€ the model to figure out the strategy by itself.

In other words, as the students learn to solve problems using highly complex reasoning processes not explained to them, they implicitly learn to ā€œthinkā€ the same way.

Unsurprisingly, Orca2-13B outperforms similar-sized models like LLaMA-2-Chat-13B and WizardLM-13B in all reasoning tasks while competing closely with larger models like LLaMA-2-Chat-70B and WizardLM-70B.

For reference, WizardLM-70B is considered the best open-source model and the sixth best overall according to HuggingFaceā€™s Chatbot Arena.

Also, Orca2ā€™s base model is LLaMa 2, proving that while using the same model as its counterparts, the cautious reasoning training procedure proves to be unequivocally superior.

So what now, Microsoft?

With Orca2, not only we are witnessing the new frontier for open-source models, but Microsoft is also being very clear about its strategy:

Use OpenAI to develop better capabilities over time with ever-larger models, and use these new capabilities to train smaller models for massive deployment.

Thankfully, they are also helping the world understand better LM training while taking us closer than ever to achieving the next leap in AI, System 2 thinking, deliberate, analytical, and conscious thinkingā€¦ something our friends at OpenAi might have achieved already with Q*, described in this weekā€™s Leaders issue below...

šŸ«” Key contributions šŸ«”

  • Orca 2 represents a leap for SLMs and distillation procedures, teaching a smaller model to strategize about its answers.

  • It is vastly superior to any other similar model and those much bigger, setting a new frontier for open-source SLMs

šŸ”® Practical implications šŸ”®

  • SLMs are the future of AI, and will be embedded into every known software

  • Although not proven, Orca 2 is probably already being deployed as the underlying LM behind Microsoft Copilots

šŸ‘¾ Best news of the week šŸ‘¾

šŸ˜ Pika labs sets a new frontier for text-to-video models with this insane video

šŸ„³ Andrej Karpathyā€™s busy intro to LLMs video

šŸŽ™ Meta unveils Seamless Communication, the ultimate AI translator

šŸ“– DeepLearning.ai unveils a short free course to improve your RAG solutions

šŸ„‡ Leaders šŸ„‡

Q*, the Secret Model that Almost Killed OpenAI

Q* (pronounced Q-star) is, undoubtedly, the news of the weekā€¦ or the year.

As put by Jim Fan, never in history an AI model of which we know very little (but more than it seems) has generated as much noise.

From people saying it is a threat to humanity, to those claiming it to be the algorithmic breakthrough that will finally bring AGI to us, everyone seems to be talking about it.

In the meantime, while GPT-4 or Dall-e3 took the spotlight, OpenAI has been lowkey releasing several research papers that hinted at this discovery.

A discovery that, according to Reuters, almost killed the $90 billion company.

But is it a threat to humankind, or the marvel weā€™ve been waiting for?

The Great Discovery 

Although you already know by now, OpenAI almost crumbled to the ground in a matter of days after Sam Altman, the CEO who turned the company into a $90 billion behemoth, was ousted for not being ā€œcandidā€ with its communications to the Board.

Unconditional ā€˜loveā€™

But after almost every employee in the company promised to leave with Sam unless he returned, the Board brought him back.

This act of ā€˜loveā€™ by employees looks much more like a panic attack about the idea of losing their secondary sale exit at a $90 billion valuation. Employees were losing millions with all this.

But why did the Board fire him?

According to Reuters, the Board received a letter about a new ā€˜breakthroughā€™ achieved internally named Q* that scared them to death and forced the dismissal.

Although OpenAI officially claims that the letter never existed, Sam itself has acknowledged the existence of the model. When asked about it, he simply said ā€œNo particular comment on that unfortunate leak.ā€œ

Thus, to understand what is this model and why it is raising so many questions, we need to circle back a few years in time to one of the most important human discoveries in history.

Subscribe to Full Premium package to read the rest.

Become a paying subscriber of Full Premium package to get access to this post and other subscriber-only content.

Already a paying subscriber? Sign In.

A subscription gets you:

  • ā€¢ NO ADS
  • ā€¢ An additional insights email on Tuesdays
  • ā€¢ Gain access to TheWhiteBox's knowledge base to access four times more content than the free version on markets, cutting-edge research, company deep dives, AI engineering tips, & more