- Microsoft's Orca2 and Q*, the Secret that Almost Killed OpenAI
Microsoft's Orca2 and Q*, the Secret that Almost Killed OpenAI
🏝 TheTechOasis 🏝
Breaking down the most advanced AI systems in the world to prepare you for your future.
5-minute weekly reads.
AI Research of the Week: Orca2, an Amazing New Type of LLM
Leaders: Q*, the Secret Model that Almost Killed OpenAI
🤯 AI Research of the week 🤯
As we discussed the previous week, Small Language Models (SLMs) are all the rage right now.
Also, Microsoft launched the new version of their SLM crown jewel, Orca2, creating a new type of language model, Cautious Reasoners.
They set a new threshold in the AI industry by beating models up to 10 times larger in highly complex reasoning tasks.
Also, they have provided a clear view of their AI strategy, as well as invaluable insights into the complex world of Transformer learning.
And today we are going to deep dive into how they created this new paradigm.
The Imitation Game
When Microsoft presented Orca’s first version, the first open-source model that was truly at the level of ChatGPT-3.5, the AI industry finally started paying attention to smaller models.
Today, Orca is not only viewed as a critical innovation, but it’s also a centerpiece in Microsoft’s strategy, as it is rumored that the LLM running behind Microsoft’s Copilots, the talk of the town in the industry, is not ChatGPT but Orca, due to the insane costs of running +100 billion parameter models.
Microsoft’s premise is simple, if we can get a model that gives us 90% of the larger model’s capabilities at a 10x fraction of the cost, we are taking that route.
But you may wonder, how are we building models ten times smaller than the big guys while having most of their capabilities?
And the answer to that is distillation.
Teach me the ways, ChatGPT
Far and wide, the most common way of training small language models is through distillation, the imitation game.
First and foremost, there’s an inescapable truth that as language models grow bigger, they become better as they learn new capabilities.
Consequently, as neural networks can learn anything if provided with sufficient data, researchers realized: What if instead of training a model to learn to model language by itself, we taught it to copy another model?
This is what we call distillation, the training process where a student model learns to imitate a teacher model by learning the data distribution of its responses. In other words, to learn to imitate it.
But how does this work?
Taking as an example LLMs like ChatGPT, they output the probability distribution over the next token.
In layman’s terms, given a text sequence, they output the word/subword with the highest probability of being ‘adequate’ out of the complete LLM vocabulary.
For instance, for the text sequence “The UK’s capital is London”, we mask—hide—the word “London” and the model’s objective is to, among all the words in its vocabulary, assign the highest probability to “London”.
But in the distillation process, we add an extra step. The student not only has to learn this very same procedure, but also needs to learn to match its outputs to the teacher’s.
Put simply, by collecting several examples like the one above from the teacher, the student learns to model language and to imitate the teacher.
But there’s a catch.
In standard distillation, the student learns to imitate style and eloquence, but when prompted with reasoning tasks, pails in comparison.
It’s like if you learn the answer to a math problem by heart. If you don’t understand it and simply memorize it, the moment things get complex you fail.
But then came Orca’s first version.
To salvage this problem, Microsoft introduced explanation tuning.
Here, when prompting the teacher model to create the distillation dataset, researchers asked the teacher to give full details of its reasoning.
Comparing the earlier image to this one, here researchers added a system instruction, where they ask the teacher model—GPT-4 in this case—to explain its reasoning.
And with this, Orca became the first model to beat a large frontier model, GPT-3.5, in almost every aspect… despite being 10 times smaller.
But still, this process is “sub-optimal”, as the very same researchers claim, meaning that the model still has a hard time with certain tasks.
To solve this, they created the Orca2, the cautious reasoner.
Think before you type
Even though explanation tuning helped Orca imitate GPT-4’s reasoning capabilities, it didn’t come close in several aspects.
So this time researchers didn’t simply teach Orca to reason like GPT-4, they taught it to approach problem-solving similarly by inducing this thought process with prompt erasing.
Masking the thought process
LMs are heavily influenced by the given prompt and the approach to solving a problem. Even with the best models, they can get a problem right or wrong depending on the approach they use.
Consequently, if we want a much smaller model to improve, we need to find a way to induce the teacher’s capacity to choose the correct problem-solving strategy for the student.
Therefore, when creating the synthetic dataset of GPT-4 prompts to train the new Orca, they asked it to explain its response and use the best problem-solving strategy for each task (step-by-step, explain-then-answer, direct answer, etc.).
Hence, they assembled a huge dataset of GPT-4 problem-solving examples with different strategies, using the following structure:
System instruction: How the model needs to behave when answering (the strategy)
An example system instruction.
But this time, during training, the researchers hid the system instruction from the student, what we define as prompt erasing, and “forcing” the model to figure out the strategy by itself.
In other words, as the students learn to solve problems using highly complex reasoning processes not explained to them, they implicitly learn to “think” the same way.
Unsurprisingly, Orca2-13B outperforms similar-sized models like LLaMA-2-Chat-13B and WizardLM-13B in all reasoning tasks while competing closely with larger models like LLaMA-2-Chat-70B and WizardLM-70B.
Also, Orca2’s base model is LLaMa 2, proving that while using the same model as its counterparts, the cautious reasoning training procedure proves to be unequivocally superior.
So what now, Microsoft?
With Orca2, not only we are witnessing the new frontier for open-source models, but Microsoft is also being very clear about its strategy:
Use OpenAI to develop better capabilities over time with ever-larger models, and use these new capabilities to train smaller models for massive deployment.
Thankfully, they are also helping the world understand better LM training while taking us closer than ever to achieving the next leap in AI, System 2 thinking, deliberate, analytical, and conscious thinking… something our friends at OpenAi might have achieved already with Q*, described in this week’s Leaders issue below...
Orca 2 represents a leap for SLMs and distillation procedures, teaching a smaller model to strategize about its answers.
It is vastly superior to any other similar model and those much bigger, setting a new frontier for open-source SLMs
🔮 Practical implications 🔮
SLMs are the future of AI, and will be embedded into every known software
Although not proven, Orca 2 is probably already being deployed as the underlying LM behind Microsoft Copilots
👾 Best news of the week 👾
😍 Pika labs sets a new frontier for text-to-video models with this insane video
🥳 Andrej Karpathy’s busy intro to LLMs video
🎙 Meta unveils Seamless Communication, the ultimate AI translator
📖 DeepLearning.ai unveils a short free course to improve your RAG solutions
🥇 Leaders 🥇
Q*, the Secret Model that Almost Killed OpenAI
Q* (pronounced Q-star) is, undoubtedly, the news of the week… or the year.
As put by Jim Fan, never in history an AI model of which we know very little (but more than it seems) has generated as much noise.
From people saying it is a threat to humanity, to those claiming it to be the algorithmic breakthrough that will finally bring AGI to us, everyone seems to be talking about it.
In the meantime, while GPT-4 or Dall-e3 took the spotlight, OpenAI has been lowkey releasing several research papers that hinted at this discovery.
A discovery that, according to Reuters, almost killed the $90 billion company.
But is it a threat to humankind, or the marvel we’ve been waiting for?
The Great Discovery
Although you already know by now, OpenAI almost crumbled to the ground in a matter of days after Sam Altman, the CEO who turned the company into a $90 billion behemoth, was ousted for not being “candid” with its communications to the Board.
But after almost every employee in the company promised to leave with Sam unless he returned, the Board brought him back.
But why did the Board fire him?
According to Reuters, the Board received a letter about a new ‘breakthrough’ achieved internally named Q* that scared them to death and forced the dismissal.
Although OpenAI officially claims that the letter never existed, Sam itself has acknowledged the existence of the model. When asked about it, he simply said “No particular comment on that unfortunate leak.“
Thus, to understand what is this model and why it is raising so many questions, we need to circle back a few years in time to one of the most important human discoveries in history.
Subscribe to Leaders to read the rest.
Become a paying subscriber of Leaders to get access to this post and other subscriber-only content.
Already a paying subscriber? Sign In