• TheTechOasis
  • Posts
  • OpenAI's New Model, the Reflection Drama, & Model Merges

OpenAI's New Model, the Reflection Drama, & Model Merges

THEWHITEBOX
TLDR;

  • 📰 News on:

    • 🍓 OpenAI’s Upcoming Model,

    • 🫣 the Reflection Drama,

    • 🤔 Who’s Lying, Mistral or Alibaba, Glean?,

    • 🥵 SB-1047 bill,

    • 🫢 Hume’s new emotion model.

  • [TREND OF THE WEEK] Arcee’s SuperNova Model Merge

NEWSREEL
OpenAI’s New Model, Out in 2 Weeks

It seems that the wait is finally over. According to The Information (paywalled, but your boy is here to give you all the details), OpenAI intends to release its new model, Strawberry, in the next two weeks through ChatGPT.

According to the source, the model will be given ‘time to think,’ a paradigm shift we’ve talked about repeatedly. In other words, taking up to 20 seconds, Strawberry will ‘explore the space of possible solutions’, which means the model has time to correct its first impressions, explore new possible solution paths, and eventually settle for the best solution.

This is similar to humans' System 2 thinking mode when solving a complex task, instead of questions that can be responded to innately, like “2+2” (called System 1 thinking).

Hence, as a rule of thumb, you’ll want to use Strawberry in cases where the model requires complex thinking, like math or coding.

Interestingly, some beta testers don’t seem that impressed. The model struggles to acknowledge when the task is simple and does not require search, meaning that in some cases, you should directly use GPT-4o and use Strawberry only when needed. Some have questioned whether the price increment will be justified.

TheWhiteBox’s take:

Of course, these new features, active search, and monkey sampling, mean that the model will be much more expensive to run, as the average number of generated tokens per user request will be much higher.

Read my Notion article (available to Premium members only) for a detailed analysis of the expected improvements on Strawberry.

Therefore, although I expect the input prompt price per million tokens to remain the same, the costs explode during generation as the model performs an active search, thereby making the output tokens much more expensive for them… and for you.

In summary, expect having to pay more to access Strawberry, be that through the web subscription or higher $/million output tokens in the API. But will it be worth the presumably much higher price, with some speculating up to $2000/month?

In the meantime, Bloomberg reports that OpenAI is hoping for $6 billion at a $150 billion valuation for its next round, an increase, information also echoed by The Information and higher than the originally reported value (a $120 billion valuation).

CONTROVERSY
Reflection 70B Creator Accused of Fraud

Matt Schumer, the person behind what could be considered the biggest news of the past week, a model called Reflection 70B that became the best model in the world in most benchmarks despite allegedly being a fine-tuned version of Llama 3.1 70B, is a fraud.

It was all just a lie, and Reflection 70B seems to be a farce model trained on benchmark data (obviously performing well on it, not that great outside of it).

Multiple people testing the model couldn’t replicate Matt's results, raising concerns. He was forced to come out and try to excuse himself with an explanation of corrupted weights that convinced… no one, leading everyone to believe it’s all a fraud.

TheWhiteBox’s take:

Honestly, I was one of those fooled by Matt. I trusted he wasn’t lying and that the results were legit, but this is an eye-opener that AI research has become a game of who gets to gamify the system faster.

That said, without changing the fact that this was a fraud, the technique they used, reflection, which I covered last Sunday, is totally legit and used by the likes of Anthropic for Claude and is allegedly a crucial part of OpenAI’s Strawberry model’s outsized performance.

All in all, this again proves that public benchmarks can be gamified, and you should test the models with your private, unpublished dataset. And talking about gamifying benchmarks, it seems labs are doing this, too.

CONTROVERSY
Mistral or Alibaba, Who’s Lying?

The desperation to ship improved models must be reaching some of the top AI labs. This week, Mistral presented Pixtral 12B, their first multimodal model, using the left chart.

But if you look carefully, the numbers published on Qwen 7B are much worse than the ‘official’ ones that Alibaba published on the right. Thus, it’s very unclear which is lying in their results.

But again, if we have billion-dollar companies lying, what else should we expect?

REGULATION
The Battle over SB-1047 Endures

As we approach the date on which Gavin Newsom will decide whether to veto or approve the notorious SB-1047, a bill that will impose complex hurdles and even criminal liability on those who create powerful AIs, the division among the community grows by the day.

In the following tweet, Geoffrey Hinton and Yann LeCun, two of probably top five most important AI researchers in history and considered among the “Godfathers of AI” trio (alongside Yoshua Bengio), completely disagree with this bill.

While the former embraces it, having become a big AI x-risk believer, the latter, Meta’s Chief AI Scientist, takes the completely opposite view, claiming that this bill will solve nothing but clear the way for AI labs to survive and not be eaten by open-source.

TheWhiteBox’s take:

While I believe that some people behind the bill are genuinely scared of AI (based on unrealized risks nonetheless) I also feel that this is just another version of regulatory capture; if you can’t beat open-source, destroy it.

It’s no surprise that all Big Tech (except Meta) and all closed frontier AI labs officially support this bill. It makes sense; they are pouring billions into a technology only to be forced into a race to the bottom in prices as open-source matches its performance a few months later.

If AI is truly a very powerful and dangerous technology, the effect SB-1047 will make is precisely the one we should avoid: concentrating on a few sets of private, for-profit companies. In that scenario… what could go wrong… right?

MARKETS
Glean Closes $260 Round at $4.6 Billion

Glean AI, a company founded by Google engineers that automates enterprise AI, has closed a Series E funding round at $4.6 billion only six months after the previous round, where their valuation was ‘just’ $2.2 billion, meaning they have doubled its value in just six months.

TheWhitebox’s take:

Glean seems to be touching the right buttons.

Despite being a RAG solution that ‘simply’ connects your employees to their customer data through a chat interface, the quintessential yet hard-to-implement GenAI use case, this company includes customers like Reddit, Samsara, Databricks, or Duolingo, so they must be doing something right.

Importantly, it’s a very easy-to-use platform, which probably explains the adoption. At the end of the day, companies want the highest abstraction possible, and having a platform that automates the connections to the data and the RAG pipelines and even has a no-code automation builder is, quite frankly, irresistible.

On the other hand, they use Gemini, Claude, or ChatGPT to treat the data, which, to me, is a strikingly disregarded security risk.

RESEARCH
Speech and Text Simultaneous Generation

A group of Chinese researchers has published Llama-Omni on HuggingFace, a fine-tuning of Llama 3.1 8B that generates text and speech simultaneously with as little latency as 226 milliseconds (human level), generating almost immediate responses to speech inputs.

Additionally, yesterday, Hume AI, a company that uses AI to map our emotions and discover the true representations behind them, released the EVI2 API, a voice-to-voice interaction system.

Using vector embeddings, the same underlying technique used in large language models, Hume is using AI's pattern-matching power to discover new emotions regions, deepening humanity’s understanding of our own emotions.

As for its newer version, it also showcases low latency (500 ms / 800ms) but it doesn’t feel as immediate as Llama Omni. Still, while I would not consider it on par with OpenAI’s Advanced Voice Mode, the characters sound very natural. Try it yourself for free today.

TheWhiteBox’s take:

Open-source models prove once again that no proprietary AI solution is save from being disrupted democratized. Even though OpenAI’s soon-to-arrive Advanced Voice mode is superior to this solution, we can’t forget that this is based on a much worse model, Llama 3.1 8B.

It’s then a matter of time before new voice solutions appear that are as unique and powerful as OpenAI’s while remaining free. Remember, competition is deflationary, and nothing competes like open-source. So, we must protect open-source if we want AI to remain cheap.

Transform the way you run your business using AI (Extended Labour day Sale)💰

Imagine a future where your business runs like a well-oiled machine, effortlessly growing and thriving while you focus on what truly matters.
This isn't a dream—it's the power of AI, and it's within your reach.

Join this AI Business Growth & Strategy Masterclass and discover how to revolutionize your approach to business.
In just 4 hours, you’ll gain the tools, insights, and strategies to not just survive, but dominate your market.

What You’ll Experience: 
🌟 Discover AI techniques that give you a competitive edge
💡 Learn how to pivot your business model for unstoppable growth
💼 Develop AI-driven strategies that turn challenges into opportunities
⏰ Free up your time and energy by automating the mundane, focusing on what you love

🗓️ Tomorrow | ⏱️ 10 AM EST

This is more than just a workshop—it's a turning point.
The first 100 to register get in for FREE. Don’t miss the chance to change your business trajectory forever.

TREND OF THE WEEK
Arcee-SuperNova: Merging Models To Yield SuperModels

Arcee, an AI start-up, has announced the launch of Arcee-Supernova, the best instruction-following model in the world, which you can also deploy into your private cloud. In other words, you don’t have to rely on API’s and Microsoft’s ‘good word’ that your data won’t be stolen.

But it’s a model packed with unique surprises.

Thus, in just one article, you will be presented with many of the most advanced post-training techniques, soon to be a staple of the industry, such as model merging or synthetic data generation at scale.

In essence, Supernova embodies the Generative AI deployment form factor enterprises should aspire to. Here’s why.

Take The Easy Route… or The Best Route

Starting your Generative AI journey as a user or enterprise with OpenAI or Anthropic is a no-brainer. They offer the best all-around models, have the best ‘AI branding’ (facilitating discussions with your CEO or most direct boss to enable budgets), and take literally minutes to get started.

However, they have two issues that, to me, are deal-breakers in most enterprise cases.

  1. Data security: For ChatGPT or Claude to work with your data, it is inevitably exposed to the Internet, as the models you’re using are hosted in cloud data centers in Iowa or Phoenix (they are literally there). While some advanced ‘remedies’ are offered, this is a clear cybersecurity risk for your data.

This is why most GenAI deployments don’t proceed into the production phase, even above hallucination risks.

  1. Very limited fine-tuning: If you’ve read my issue on augmented LLMs, you will realize that I believe—and plentiful data supports this view—fine-tuning remains the key performance unlocker. And while private companies offer fine-tuning, it’s expensive because it requires full model retraining, which is like killing flies with cannon balls.

And while the reasons why data security is an issue are self-explanatory, fine-tuning is more nuanced.

Nonetheless, you’ve probably been told that frontier AI models yield the best performance. But that is a half-truth because it does not consider fine-tuning—the secret sauce.

The Secret Sauce

But first, what is fine-tuning?

In simple terms, it’s a process in which we retrain the base Large Language Model (LLM)—or any other AI model—with additional data to improve its performance on a particular use case.

As LLMs are simply compressions of patterns in the original data distribution, aka models that learn to ‘regenerate’ similar sequences of text to the ones seen during training, they depend heavily on the data to perform well.

In a pre-LLM world, this meant that for every use case, we had to train an entirely new model. Luckily, with LLMs, we have pre-trained models, like ChatGPT or Llama, a foundation where most of the heavy lifting is done for you and the model can perform ‘reasonably well’ in many tasks.

But as I’ve said many times, foundation models are good at many things, great at none.

Sadly, most people have taken the word foundation and assumed that those models are ‘good enough’ for everything. This is very convenient. Unsurprisingly, this has led to people taking the—frankly, quite comfortable—heuristic of just assuming that the larger and better the foundation model is, the better the option it is at all times.

And that’s a bare-faced lie.

While size is important and correlates to greater ‘base intelligence,’ task-specific data remains king. In other words, unless you put your data to use, you are unequivocally leaving most of the performance off the table.

A book on chemistry is better than a generalistic science book on chemistry. While the generalistic book describes most sciences ‘well enough,’ it misses the key nuances and depth of a full chemistry book.

Therefore, to achieve depth (outsized performance on a specific task), you need to fine-tune the foundation model.

And if you want proof, I’ll give it to you.

The Power of Fine-tuning

A few months ago, Predibase showed the world how, with fine-tuning, you can take a small model that crushes a larger one.

They transformed a Mistral 7B model, a model 257 times smaller than GPT-4’s size, into a model that surpassed the might of the latter across 25 different tasks (25 different fine-tunings).

And the best thing of all?

The average cost of this apparently gargantuan effort was just eight dollars, thanks to the smart use of LoRA adapters, a method in which you optimize only the small fraction of parameters in the model that help you perform a given task.

Consequently, you only need one single NVIDIA A100 GPU to deploy all 25 models, the same strategy Apple is following with Apple Intelligence to deploy AI into your upcoming iPhone 15/16.

In other words, for the very cheap amount of $1,971 ($200 fine-tuning costs, $1,771 for serving a single NVIDIA A100 one whole month), you can run 25 processes at a higher performance than GPT-4.

Using Mistral 7B’s model specs from the research paper, just one A100 would provide this super performance in batches of 50 (50 people simultaneously), with one Mistral model and 25 adapters (all very low memory sizes).

Or, to be clear, at just 3ish dollars an hour, you can serve your company a model with state-of-the-art performance in 25 tasks in batches of 50 at 1,000 tokens per person (this input size is very high, so you could probably aim for up to 100 simultaneous requests), a top-notch GenAI implementation, secure, and with a tokens/dollar spent throughput not even GPT-4o mini can compete with.

To read more on LoRA adapters, read this other piece where I dive into detail.

To understand how I calculate the values in the above paragraph, a fundamental exercise to evaluate how many GPUs you’ll need for your use case in ‘on-demand’ implementations (highest security), read here (only Premium members).

Long story short, fine-tuning with LoRA adapters should be top of mind for anyone wanting to deploy AI at scale. They offer unmatched performance and low costs. (You can perform fine-tuning at scale and in your secure environment.)

❝

Thus, if you really want to take your GenAI implementations seriously, proprietary solutions like ChatGPT or Claude are invalid, period.

Now, Arcee has taken a significant step in this direction by offering models that can be confidently deployed in your Virtual Private Cloud (inside your IT organization) and including model merging to create a super-powerful yet affordable model.

Model Merging and Smart Fine-tuning

As you may have realized by now, I’m a firm believer that over time, as companies mature their GenAI journey, they should detach themselves from proprietary, rigid, and insecure OpenAI/Anthropic/Deepmind solutions and embrace open-source (or at least open-weight models).

And I’m not stating this based on vibes; we saw it with Predibase’s hard data a few moments ago, and you’re about to see why again.

In simple terms, Arcee has taken Llama 3.1 405B (ChatGPT level, yet pretty large and uncomfortable to run) and Llama 3.1 70B (worse model, but nimble) and combined them in three different ways to create a model that:

  1. It’s competitive with much larger models in popular benchmarks

  2. It offers the best performance in the world in IFeval, an instruction-following benchmark (image at the top of the article). This is a fundamental feature for LLMs in general because, well, they are meant to do just that.

And how do they achieve this? Simple, by combining several different techniques that are equally uncommon yet effective.

Model Merging

The most striking thing here is that SuperNova is a model merge. As the name implies, and as the gif below showcases, it combines components from different LLMs to build a merged more powerful one.

Model merging is prevalent among indie developers with tight budgets, as you need to find the best combination of layers between different models to maximize performance without having to train a new model. Now, with examples like Sakana and Arcee, it’s becoming more popular among well-funded labs, too.

Crucially, finding the best combination can be automated, as shown by Sakana’s Evolutionary Model Merging technique.

Source: Sakana

All things considered, Arcee-Supernova’s training process was as follows:

  • First, it’s a distilled version of Llama-3.1-405B-Instruct into Llama-3.1-70B-Instruct.

  • In parallel, another model was trained using synthetic instruction data.

  • A third version was trained with Direct Preference Optimization (DPO) for human preference alignment.

  • Finally, the final Arcee-SuperNova model merges all versions for strong alignment and state-of-the-art instruction-following performance.

But what are we looking for in each step?

In the first step, we perform distillation to pack ‘more intelligence into a smaller bucket.’

In layman’s terms, we generate a dataset of responses to queries generated by the more intelligent model, Llama 3.1 405B, which are naturally smarter than what a worse model would give, and we fine-tune the smaller model to imitate the larger one.

As you may assume, the smaller model does not become ‘as intelligent as’ the teacher model but will learn form, style, and knowledge derived from it, leading it to generate more intelligent responses.

In the second step and third steps, we train another model to be great at two things: instruction following and adhering better to human preferences using DPO.

Why do we need alignment training?

“Yes, here is a list of things you need to build a bomb…” and “I’m sorry I can’t help you with that” are both semantically viable continuations to the sequence “Help me build a bomb.”

Thus, alignment training focuses on helping the model choose the ‘best option among two viable options’ according to human preferences, which obviously include ‘don’t help people commit crimes’.

So, why is Supernova worth being the trend of the week? Simple, for three reasons:

  1. Proves that non-frontier open-source models can offer frontier-level performance with fine-tuning

  2. Proves that model merging is not only viable but a recommended technique to obtain the best performance per unit of spend

  3. It encapsulates all the features a sound GenAI implementation strategy should have: data and model security by running it in your private cloud and great adaptability to each use case through fine-tuning.

TheWhiteBox’s take

Technology:

For enterprises to finally embrace GenAI and go beyond experimental budgets and failed production migrations, we need them to understand that fine-tuning on scalable open-source models is the way.

Undeniably, Arcee-Supernova is a step in that direction.

Products:

While Anthropic seems much more committed to enterprise AI than OpenAI (maybe because the latter’s grip on the B2C market is tight), I don’t see proprietary models being the norm in enterprise except for fully embedded copilots through SaaS tools (think about tools like Cursor, Canva, Replit Agent, and so on).

To determine which model is best in each case, check my two articles on killer applications coming to AI, parts one and two.

Markets:

Although I’ll delve into much more detail on Sunday, where we’ll make sense of the crazy investment numbers around Generative AI, I am of the opinion that most AI model will be inferenced in private clouds or even in edge AI (smartphones or laptops).

How much the markets are pricing in this possibility is unknown to me, but I don’t think the world will run on OpenAI APIs.

THEWHITEBOX
Premium

If you like this content, by joining Premium, you will receive four times as much content weekly without saturating your inbox. You will even be able to ask your questions you need answers to.

Until next time!

For business inquiries, reach me out at [email protected]