- TheTechOasis
- Posts
- OpenAI's New Model, the Reflection Drama, & Model Merges
OpenAI's New Model, the Reflection Drama, & Model Merges
THEWHITEBOX
TLDR;
đ° News on:
đ OpenAIâs Upcoming Model,
𫣠the Reflection Drama,
đ¤ Whoâs Lying, Mistral or Alibaba, Glean?,
𼾠SB-1047 bill,
𫢠Humeâs new emotion model.
[TREND OF THE WEEK] Arceeâs SuperNova Model Merge
NEWSREEL
OpenAIâs New Model, Out in 2 Weeks
It seems that the wait is finally over. According to The Information (paywalled, but your boy is here to give you all the details), OpenAI intends to release its new model, Strawberry, in the next two weeks through ChatGPT.
According to the source, the model will be given âtime to think,â a paradigm shift weâve talked about repeatedly. In other words, taking up to 20 seconds, Strawberry will âexplore the space of possible solutionsâ, which means the model has time to correct its first impressions, explore new possible solution paths, and eventually settle for the best solution.
This is similar to humans' System 2 thinking mode when solving a complex task, instead of questions that can be responded to innately, like â2+2â (called System 1 thinking).
Hence, as a rule of thumb, youâll want to use Strawberry in cases where the model requires complex thinking, like math or coding.
Interestingly, some beta testers donât seem that impressed. The model struggles to acknowledge when the task is simple and does not require search, meaning that in some cases, you should directly use GPT-4o and use Strawberry only when needed. Some have questioned whether the price increment will be justified.
TheWhiteBoxâs take:
Of course, these new features, active search, and monkey sampling, mean that the model will be much more expensive to run, as the average number of generated tokens per user request will be much higher.
Read my Notion article (available to Premium members only) for a detailed analysis of the expected improvements on Strawberry.
Therefore, although I expect the input prompt price per million tokens to remain the same, the costs explode during generation as the model performs an active search, thereby making the output tokens much more expensive for them⌠and for you.
In summary, expect having to pay more to access Strawberry, be that through the web subscription or higher $/million output tokens in the API. But will it be worth the presumably much higher price, with some speculating up to $2000/month?
In the meantime, Bloomberg reports that OpenAI is hoping for $6 billion at a $150 billion valuation for its next round, an increase, information also echoed by The Information and higher than the originally reported value (a $120 billion valuation).
CONTROVERSY
Reflection 70B Creator Accused of Fraud
Matt Schumer, the person behind what could be considered the biggest news of the past week, a model called Reflection 70B that became the best model in the world in most benchmarks despite allegedly being a fine-tuned version of Llama 3.1 70B, is a fraud.
It was all just a lie, and Reflection 70B seems to be a farce model trained on benchmark data (obviously performing well on it, not that great outside of it).
Multiple people testing the model couldnât replicate Matt's results, raising concerns. He was forced to come out and try to excuse himself with an explanation of corrupted weights that convinced⌠no one, leading everyone to believe itâs all a fraud.
TheWhiteBoxâs take:
Honestly, I was one of those fooled by Matt. I trusted he wasnât lying and that the results were legit, but this is an eye-opener that AI research has become a game of who gets to gamify the system faster.
That said, without changing the fact that this was a fraud, the technique they used, reflection, which I covered last Sunday, is totally legit and used by the likes of Anthropic for Claude and is allegedly a crucial part of OpenAIâs Strawberry modelâs outsized performance.
All in all, this again proves that public benchmarks can be gamified, and you should test the models with your private, unpublished dataset. And talking about gamifying benchmarks, it seems labs are doing this, too.
CONTROVERSY
Mistral or Alibaba, Whoâs Lying?
The desperation to ship improved models must be reaching some of the top AI labs. This week, Mistral presented Pixtral 12B, their first multimodal model, using the left chart.
But if you look carefully, the numbers published on Qwen 7B are much worse than the âofficialâ ones that Alibaba published on the right. Thus, itâs very unclear which is lying in their results.
But again, if we have billion-dollar companies lying, what else should we expect?
REGULATION
The Battle over SB-1047 Endures
As we approach the date on which Gavin Newsom will decide whether to veto or approve the notorious SB-1047, a bill that will impose complex hurdles and even criminal liability on those who create powerful AIs, the division among the community grows by the day.
In the following tweet, Geoffrey Hinton and Yann LeCun, two of probably top five most important AI researchers in history and considered among the âGodfathers of AIâ trio (alongside Yoshua Bengio), completely disagree with this bill.
While the former embraces it, having become a big AI x-risk believer, the latter, Metaâs Chief AI Scientist, takes the completely opposite view, claiming that this bill will solve nothing but clear the way for AI labs to survive and not be eaten by open-source.
TheWhiteBoxâs take:
While I believe that some people behind the bill are genuinely scared of AI (based on unrealized risks nonetheless) I also feel that this is just another version of regulatory capture; if you canât beat open-source, destroy it.
Itâs no surprise that all Big Tech (except Meta) and all closed frontier AI labs officially support this bill. It makes sense; they are pouring billions into a technology only to be forced into a race to the bottom in prices as open-source matches its performance a few months later.
If AI is truly a very powerful and dangerous technology, the effect SB-1047 will make is precisely the one we should avoid: concentrating on a few sets of private, for-profit companies. In that scenario⌠what could go wrong⌠right?
MARKETS
Glean Closes $260 Round at $4.6 Billion
Glean AI, a company founded by Google engineers that automates enterprise AI, has closed a Series E funding round at $4.6 billion only six months after the previous round, where their valuation was âjustâ $2.2 billion, meaning they have doubled its value in just six months.
TheWhiteboxâs take:
Glean seems to be touching the right buttons.
Despite being a RAG solution that âsimplyâ connects your employees to their customer data through a chat interface, the quintessential yet hard-to-implement GenAI use case, this company includes customers like Reddit, Samsara, Databricks, or Duolingo, so they must be doing something right.
Importantly, itâs a very easy-to-use platform, which probably explains the adoption. At the end of the day, companies want the highest abstraction possible, and having a platform that automates the connections to the data and the RAG pipelines and even has a no-code automation builder is, quite frankly, irresistible.
On the other hand, they use Gemini, Claude, or ChatGPT to treat the data, which, to me, is a strikingly disregarded security risk.
RESEARCH
Speech and Text Simultaneous Generation
A group of Chinese researchers has published Llama-Omni on HuggingFace, a fine-tuning of Llama 3.1 8B that generates text and speech simultaneously with as little latency as 226 milliseconds (human level), generating almost immediate responses to speech inputs.
Additionally, yesterday, Hume AI, a company that uses AI to map our emotions and discover the true representations behind them, released the EVI2 API, a voice-to-voice interaction system.
Using vector embeddings, the same underlying technique used in large language models, Hume is using AI's pattern-matching power to discover new emotions regions, deepening humanityâs understanding of our own emotions.
As for its newer version, it also showcases low latency (500 ms / 800ms) but it doesnât feel as immediate as Llama Omni. Still, while I would not consider it on par with OpenAIâs Advanced Voice Mode, the characters sound very natural. Try it yourself for free today.
TheWhiteBoxâs take:
Open-source models prove once again that no proprietary AI solution is save from being disrupted democratized. Even though OpenAIâs soon-to-arrive Advanced Voice mode is superior to this solution, we canât forget that this is based on a much worse model, Llama 3.1 8B.
Itâs then a matter of time before new voice solutions appear that are as unique and powerful as OpenAIâs while remaining free. Remember, competition is deflationary, and nothing competes like open-source. So, we must protect open-source if we want AI to remain cheap.
Transform the way you run your business using AI (Extended Labour day Sale)đ°
Imagine a future where your business runs like a well-oiled machine, effortlessly growing and thriving while you focus on what truly matters.
This isn't a dreamâit's the power of AI, and it's within your reach.
Join this AI Business Growth & Strategy Masterclass and discover how to revolutionize your approach to business.
In just 4 hours, youâll gain the tools, insights, and strategies to not just survive, but dominate your market.
What Youâll Experience:
đ Discover AI techniques that give you a competitive edge
đĄ Learn how to pivot your business model for unstoppable growth
đź Develop AI-driven strategies that turn challenges into opportunities
â° Free up your time and energy by automating the mundane, focusing on what you love
đď¸ Tomorrow | âąď¸ 10 AM EST
This is more than just a workshopâit's a turning point.
The first 100 to register get in for FREE. Donât miss the chance to change your business trajectory forever.
TREND OF THE WEEK
Arcee-SuperNova: Merging Models To Yield SuperModels
Arcee, an AI start-up, has announced the launch of Arcee-Supernova, the best instruction-following model in the world, which you can also deploy into your private cloud. In other words, you donât have to rely on APIâs and Microsoftâs âgood wordâ that your data wonât be stolen.
But itâs a model packed with unique surprises.
Thus, in just one article, you will be presented with many of the most advanced post-training techniques, soon to be a staple of the industry, such as model merging or synthetic data generation at scale.
In essence, Supernova embodies the Generative AI deployment form factor enterprises should aspire to. Hereâs why.
Take The Easy Route⌠or The Best Route
Starting your Generative AI journey as a user or enterprise with OpenAI or Anthropic is a no-brainer. They offer the best all-around models, have the best âAI brandingâ (facilitating discussions with your CEO or most direct boss to enable budgets), and take literally minutes to get started.
However, they have two issues that, to me, are deal-breakers in most enterprise cases.
Data security: For ChatGPT or Claude to work with your data, it is inevitably exposed to the Internet, as the models youâre using are hosted in cloud data centers in Iowa or Phoenix (they are literally there). While some advanced âremediesâ are offered, this is a clear cybersecurity risk for your data.
This is why most GenAI deployments donât proceed into the production phase, even above hallucination risks.
Very limited fine-tuning: If youâve read my issue on augmented LLMs, you will realize that I believeâand plentiful data supports this viewâfine-tuning remains the key performance unlocker. And while private companies offer fine-tuning, itâs expensive because it requires full model retraining, which is like killing flies with cannon balls.
And while the reasons why data security is an issue are self-explanatory, fine-tuning is more nuanced.
Nonetheless, youâve probably been told that frontier AI models yield the best performance. But that is a half-truth because it does not consider fine-tuningâthe secret sauce.
The Secret Sauce
But first, what is fine-tuning?
In simple terms, itâs a process in which we retrain the base Large Language Model (LLM)âor any other AI modelâwith additional data to improve its performance on a particular use case.
As LLMs are simply compressions of patterns in the original data distribution, aka models that learn to âregenerateâ similar sequences of text to the ones seen during training, they depend heavily on the data to perform well.
In a pre-LLM world, this meant that for every use case, we had to train an entirely new model. Luckily, with LLMs, we have pre-trained models, like ChatGPT or Llama, a foundation where most of the heavy lifting is done for you and the model can perform âreasonably wellâ in many tasks.
But as Iâve said many times, foundation models are good at many things, great at none.
Sadly, most people have taken the word foundation and assumed that those models are âgood enoughâ for everything. This is very convenient. Unsurprisingly, this has led to people taking theâfrankly, quite comfortableâheuristic of just assuming that the larger and better the foundation model is, the better the option it is at all times.
And thatâs a bare-faced lie.
While size is important and correlates to greater âbase intelligence,â task-specific data remains king. In other words, unless you put your data to use, you are unequivocally leaving most of the performance off the table.
A book on chemistry is better than a generalistic science book on chemistry. While the generalistic book describes most sciences âwell enough,â it misses the key nuances and depth of a full chemistry book.
Therefore, to achieve depth (outsized performance on a specific task), you need to fine-tune the foundation model.
And if you want proof, Iâll give it to you.
The Power of Fine-tuning
A few months ago, Predibase showed the world how, with fine-tuning, you can take a small model that crushes a larger one.
They transformed a Mistral 7B model, a model 257 times smaller than GPT-4âs size, into a model that surpassed the might of the latter across 25 different tasks (25 different fine-tunings).
And the best thing of all?
The average cost of this apparently gargantuan effort was just eight dollars, thanks to the smart use of LoRA adapters, a method in which you optimize only the small fraction of parameters in the model that help you perform a given task.
Consequently, you only need one single NVIDIA A100 GPU to deploy all 25 models, the same strategy Apple is following with Apple Intelligence to deploy AI into your upcoming iPhone 15/16.
In other words, for the very cheap amount of $1,971 ($200 fine-tuning costs, $1,771 for serving a single NVIDIA A100 one whole month), you can run 25 processes at a higher performance than GPT-4.
Using Mistral 7Bâs model specs from the research paper, just one A100 would provide this super performance in batches of 50 (50 people simultaneously), with one Mistral model and 25 adapters (all very low memory sizes).
Or, to be clear, at just 3ish dollars an hour, you can serve your company a model with state-of-the-art performance in 25 tasks in batches of 50 at 1,000 tokens per person (this input size is very high, so you could probably aim for up to 100 simultaneous requests), a top-notch GenAI implementation, secure, and with a tokens/dollar spent throughput not even GPT-4o mini can compete with.
To read more on LoRA adapters, read this other piece where I dive into detail.
To understand how I calculate the values in the above paragraph, a fundamental exercise to evaluate how many GPUs youâll need for your use case in âon-demandâ implementations (highest security), read here (only Premium members).
Long story short, fine-tuning with LoRA adapters should be top of mind for anyone wanting to deploy AI at scale. They offer unmatched performance and low costs. (You can perform fine-tuning at scale and in your secure environment.)
Thus, if you really want to take your GenAI implementations seriously, proprietary solutions like ChatGPT or Claude are invalid, period.
Now, Arcee has taken a significant step in this direction by offering models that can be confidently deployed in your Virtual Private Cloud (inside your IT organization) and including model merging to create a super-powerful yet affordable model.
Model Merging and Smart Fine-tuning
As you may have realized by now, Iâm a firm believer that over time, as companies mature their GenAI journey, they should detach themselves from proprietary, rigid, and insecure OpenAI/Anthropic/Deepmind solutions and embrace open-source (or at least open-weight models).
And Iâm not stating this based on vibes; we saw it with Predibaseâs hard data a few moments ago, and youâre about to see why again.
In simple terms, Arcee has taken Llama 3.1 405B (ChatGPT level, yet pretty large and uncomfortable to run) and Llama 3.1 70B (worse model, but nimble) and combined them in three different ways to create a model that:
Itâs competitive with much larger models in popular benchmarks
It offers the best performance in the world in IFeval, an instruction-following benchmark (image at the top of the article). This is a fundamental feature for LLMs in general because, well, they are meant to do just that.
And how do they achieve this? Simple, by combining several different techniques that are equally uncommon yet effective.
Model Merging
The most striking thing here is that SuperNova is a model merge. As the name implies, and as the gif below showcases, it combines components from different LLMs to build a merged more powerful one.
Model merging is prevalent among indie developers with tight budgets, as you need to find the best combination of layers between different models to maximize performance without having to train a new model. Now, with examples like Sakana and Arcee, itâs becoming more popular among well-funded labs, too.
Crucially, finding the best combination can be automated, as shown by Sakanaâs Evolutionary Model Merging technique.
Source: Sakana
All things considered, Arcee-Supernovaâs training process was as follows:
First, itâs a distilled version of Llama-3.1-405B-Instruct into Llama-3.1-70B-Instruct.
In parallel, another model was trained using synthetic instruction data.
A third version was trained with Direct Preference Optimization (DPO) for human preference alignment.
Finally, the final Arcee-SuperNova model merges all versions for strong alignment and state-of-the-art instruction-following performance.
But what are we looking for in each step?
In the first step, we perform distillation to pack âmore intelligence into a smaller bucket.â
In laymanâs terms, we generate a dataset of responses to queries generated by the more intelligent model, Llama 3.1 405B, which are naturally smarter than what a worse model would give, and we fine-tune the smaller model to imitate the larger one.
As you may assume, the smaller model does not become âas intelligent asâ the teacher model but will learn form, style, and knowledge derived from it, leading it to generate more intelligent responses.
In the second step and third steps, we train another model to be great at two things: instruction following and adhering better to human preferences using DPO.
Why do we need alignment training?
âYes, here is a list of things you need to build a bombâŚâ and âIâm sorry I canât help you with thatâ are both semantically viable continuations to the sequence âHelp me build a bomb.â
Thus, alignment training focuses on helping the model choose the âbest option among two viable optionsâ according to human preferences, which obviously include âdonât help people commit crimesâ.
So, why is Supernova worth being the trend of the week? Simple, for three reasons:
Proves that non-frontier open-source models can offer frontier-level performance with fine-tuning
Proves that model merging is not only viable but a recommended technique to obtain the best performance per unit of spend
It encapsulates all the features a sound GenAI implementation strategy should have: data and model security by running it in your private cloud and great adaptability to each use case through fine-tuning.
TheWhiteBoxâs take
Technology:
For enterprises to finally embrace GenAI and go beyond experimental budgets and failed production migrations, we need them to understand that fine-tuning on scalable open-source models is the way.
Undeniably, Arcee-Supernova is a step in that direction.
Products:
While Anthropic seems much more committed to enterprise AI than OpenAI (maybe because the latterâs grip on the B2C market is tight), I donât see proprietary models being the norm in enterprise except for fully embedded copilots through SaaS tools (think about tools like Cursor, Canva, Replit Agent, and so on).
To determine which model is best in each case, check my two articles on killer applications coming to AI, parts one and two.
Markets:
Although Iâll delve into much more detail on Sunday, where weâll make sense of the crazy investment numbers around Generative AI, I am of the opinion that most AI model will be inferenced in private clouds or even in edge AI (smartphones or laptops).
How much the markets are pricing in this possibility is unknown to me, but I donât think the world will run on OpenAI APIs.
THEWHITEBOX
Premium
If you like this content, by joining Premium, you will receive four times as much content weekly without saturating your inbox. You will even be able to ask your questions you need answers to.
Until next time!
Give a Rating to Today's Newsletter |
For business inquiries, reach me out at [email protected]