• TheTechOasis
  • Posts
  • ChatGPT-4.5 leaked?, Arctic, the King of Enterprise Intelligence

ChatGPT-4.5 leaked?, Arctic, the King of Enterprise Intelligence

🏝 TheTechOasis 🏝

part of the:

Welcome to the newsletter that keeps you updated on the latest developments at the cutting edge of AI by breaking down the most advanced systems in the world & the hottest news in the industry.

10-minute weekly reads.

💎 Big Announcement 💎

In know you’re in full AI Overload and tired of low-impact daily newsletters with loads of hype and little substrate to learn from.

But worry not with TheWhiteBox, a community for high-quality, highly-curated AI content without unnecessary hype or ads, across research, models, markets, and AI products in digestible language.

Why would you want that?

  • Cut the Clutter: Value-oriented, focusing on high-impact news and clear intuitions to extract from them (why should you care).

  • Connect with Peers: Engage in valuable discussions with like-minded AI enthusiasts from top global companies.

  • Exclusive Insights: As the community grows, gain access to special content like expert interviews and guest posts with unique perspectives.

With TheWhiteBox, we guarantee you won’t need anything else.

No credit card information required to join the waitlist.

🚨 Week’s Update 🚨

This week we are starting strong. Former NASA, DARPA, and Deepmind alumni got together to create Field AI, a company that creates robots that adapt to ever-changing environments.

They aim to create Field Foundation Models, the equivalent to Language FMs like ChatGPT but applied to the real world. They claim the robots require little-to-no supervision, are fully autonomous, and don’t require guidance (no need for GPS, maps, and so on) to work in unknown environments.

Take this with a pinch of salt, sounds too good to be true. Even if they are the best at it, navigating out-of-distribution (unknown and very different) environments is the hardest thing in AI robotics.

If true, this is the holy grail. Let’s see if they actually deliver.

Moving on, OpenAI, while playing games with all of us (more on that below) continues to ship interesting features to their ChatGPT Plus subscription. This time, it allows the model to learn about you with a new memory feature.

Hold your horses just one second. Interesting feature, but not an extraordinary one. They are simply adding this to the ‘system instructions’ part of the prompt they send every time you ask something.

Don’t be fooled, this is prompt engineering, not a model enhancement.

On the robotics side, Freethink has released an interesting article that pretty much sums up all the humanoids currently being developed at the frontier of AI.

Furthermore, as you know, AI will play a huge role in warfare. Among the companies that will make sure this happens is Anduril, which just released a next-generation vehicle with insane capabilities.

But AI also has other open questions like regulation. Recently, a new bill in California that could be passed, SB-1047.

Among its backers, prominent figures like Yoshua Bengio or Geoffrey Hinton, godfathers of modern AI. Others claim it kills open-source. The controversy is so strong that the senator behind it wrote a long thread to dismiss some accusations.

Finally, one very exciting news, Scale AI has created a new eval dataset that guarantees no known model has ever seen, avoiding unrealistic results through data contamination (when the model has actually seen the data, basically cheating).

Phi-3-mini, the model we covered last week, achieved a 76% result with ChatGPT-4 Turbo getting 84%, despite being 3.8 billion parameters, hundreds of times smaller.

💎 Sponsored content 💎

MaxAI.me - Do more Faster with 1-Click AI

MaxAI.me lets you chat with GPT-4, Claude 3, Gemini 1.5. You can also perfect your writing anywhere, save 90% of your reading & watching time with AI summary, and reply 10x faster on email & social media.

🧐 You Should Pay Attention to 🧐

  • GPT2-chatbot, ChatGPT-4.5 leaked?

  • Snowflake Arctic, The New King of Enterprise LLMs?

📈 Has ChatGPT-4.5 Been Leaked? 📈

A few days ago, a model many of us played with that has since been deleted, has the entire AI industry speculating. Named “gpt2-chatbot” it was accessible for a few days in the ‘Direct Chat’ function in lmsys.org.

But why so much fuss?

Well, simply because it might have been the unofficial teaser of ChatGPT-4.5 or even GPT-5. Or, maybe, the ‘2’ stands for a new GPT generation that is a ‘different thing’.

But why do I say this? Well, because this model is unlike anything we have ever seen. It’s absolutely on a different level.

Even Sam Altman, CEO of OpenAI, couldn’t resist the temptation to acknowledge its existence:

So, how good is this model, and what on Earth is it?

A Teaser of What’s to Come

With every passing day, it’s clear that OpenAI’s next model will be a leap in reasoning and complex problem-solving. And here are just a few examples of the prowess of this mysterious model that could signal that the boat has finally landed in that port:

All examples below are considered hard or outright impossible for the current state-of-the-art models.

For starters, It solved a math-olympiad problem in zero-shot mode (without being provided with auxiliary examples to support the resolution):

It’s absolutely superb at parsing JSONs, a fundamental skill for LLM integration with APIs and other web-based tools.

Also, it completely obliterates GPT-4 at complex drawing tasks like drawing SVG files based on code or unicorns using ASCII code (below), humiliating Claude 3 Opus, the current state-of-the-art, in the process:

gpt2-chatbot (left), & Claude 3 Opus (right)

Additionally, although this very well could have been a hallucination, the model claimed to me that it was trained by OpenAI and based on a GPT-4 variant.

Of course, after such a demonstration of power, many suggest that “gpt2-chatbot” might even be the famous Q* model. But instead of simply giving in to the different fanciful options people have claimed this is, let’s take a more sensible approach and see what OpenAI itself has been hinting at through their research for months (and years).

The Power of Long Inference

For several months, experts in the space like Demis Hassabis or Andrej Karpathy have discussed how LLMs alone simply aren’t it, and that we need ‘something else’ to really take them to the next step.

In both cases, they refer to achieving the equivalent of ‘AlphaGo but in LLMs’, which is indirectly referring to:

  • Self-improvement and

  • test-time computation LLMs

But what do they mean by that?

A Giant Step for AI

AlphaGo is history of AI. It was the first model that unequivocally surpassed human might in the game of Go, a Korean board game.

It used Monte Carlo Tree Search, a search algorithm, to explore the realm of possible moves for any given step in the game, being able to go beyond the current action and predict what the opposing player would do.

Some of you might remember Deep Blue too, the chess machine that barely beat Gary Kasparov in the second game in their series back in 1997 after losing the first game.

However, while Deep Blue could be beaten, AlphaGo was invincible. But how?

Self-improving to go Superhuman

The key element that made AlphaGo superior was how it was trained, by playing against lesser versions of itself to create a self-improvement loop.

It consistently played against itself, gradually improving its ELO to 3.739, almost at the level of today’s best Go player.

In 2017, AlphaZero, an improved version, achieved a 5.018 ELO, completely superhuman and unbeatable.

In other words, with AlphaGo humans had achieved, for the first time, a way to train a model by self-improvement, allowing it to reach superhuman capacities as it no longer relied on imitating humans to learn.

In case you’re wondering, this is not the case for LLMs.

Current LLMs are completely chained to human-level performance, to the point that the alignment phase, the part of the training process where LLMs are modeled to improve their safety levels and avoid offensive responses, is strictly executed using ‘human preferences’.

On a sidenote, Meta recently proposed Self-Rewarding Models that could self-improve with their own responses. However, it’s unclear whether this feedback loop really can make LLMs superhuman.

But even though it still feels hard to believe that “gpt2-chatbot” has been trained through self-improvement, we have plenty of reasons to believe it’s the first successful implementation of what OpenAI has been working on for years: test-time computation.

The Arrival of test-time computation models

Over the years, several research papers by OpenAI have hinted at this idea of skewing models into ‘heavy inference’. For example, back in 2021, they presented the notion of using ‘verifiers’ at inference to improve the model’s responses when working with Math.

The idea was to train an auxiliary model that would evaluate in real-time several responses the model gave, choosing the best one (which was then served to the user).

This, combined with some sort of tree search algorithm like the one used by AlphaGo, with examples like Google Deepmind’s Tree-of-Thought research for LLMs, and you could eventually create an LLM that, before answering, explores the ‘realm of possible responses’, carefully filtering and selecting the best path toward the solution.

A Tree-of-thought example

This idea has become pretty popular these days, with cross-effort research by Microsoft and Google applying it to train next-generation verifiers, and with Google even managing to create a model, Alphacode, that executed this kind of architecture to great success, reaching the 85% percentile among competitive programmers, the best humans at it.

And why does this new generation of LLMs have so much potential?

Well, because they approach problem-solving very similarly to humans through deliberate and extensive thought to solve a given task.

Bottom line, think of ‘search+LLM’ models as AI systems that allocate a much higher degree of compute (akin to human thought) to the actual model execution so that, instead of having to guess the correct solution immediately, they are, simply put, ‘given more time to think’.

What We Think

Considering gpt2-chatbot’s insane performance and keeping in mind OpenAI’s recent research and leaks, we might have a pretty nice idea by now of what on Earth this thing is.

What we know for sure is that we are soon going to be faced with a completely different beast, one that will take AI’s impact to the next level.

I guess we will have to wait to get those answers. But not for long.

❄️ SnowFlake Arctic, The 👑 of Enterprise LLMs ❄️

Snowflake has released a new model, named Arctic, that has the enterprise community paying attention for various reasons.

It’s a 480B model that yields impressive performance in what they call 'Enterprise Intelligence' (it's okay if you cringe, I did too).

Jokes aside, the truth is that the model is freakishly good, and was trained with a budget under $2 million, signifying how good AI engineers are getting at building LLMs.

Also, it’s highly reminiscent of one of the most exciting architectures I’ve seen recently, hyper-specialized MoEs.

Cringe, but Awesome

‘Enterprise Intelligence’ is nothing but a non-weighted average of model results over a combination of 'enterprise-relevant' benchmarks for coding, SQL generation, and instruction following.

In layman’s terms, this metric tells us that the model is good at these three things, probably the best if we focus on sup-70-billion-parameter models.

Of course, the benchmarks were cherrypicked (as they always are) but they nevertheless paint a pretty damn good picture for the model.


However, it’s still huge with 480 billion parameters.

That beast alone occupies at the very least 6 state-of-the-art NVIDIA H100s (80GB of HBM each), and that's without considering the KV Cache, the LLMs cache to avoid unnecessary recomputations, which is the biggest limiting memory factor when considering large sequences.

But here's the most exciting thing about this model: It’s a deep-mixture-of-experts with 128 experts and one global expert, which paints a completely different reality at inference.

If you’re in for a more in-depth explanation, this architecture is very similar to DeepSeek-MoE's approach you can review here.

In standard Mixture-of-Experts, you divide the network into experts. In practice, for any given prediction the model needs to perform (basically predicting the next word) a router chooses a fixed set of experts to be activated.

Thus, instead of running the entire network, only a small subset of it participates, racking up savings in compute (money) and latency (time).

Thus, every expert becomes an expert in a smaller set of topics. However, Arctic, just like DeepSeekMoE, takes an in-between approach. Specifically:

  • A 10 billion-parameter expert runs everytime

  • The model includes 128, 3.66B-each additional experts, with the model choosing two for every prediction.

This leaves us with an architecture eerily similar to the one to the right below (with the one on the left being the standard MoE), where expert number 1 (the global expert) runs every time, and 2 out of the 128 additional experts will be chosen to participate in the prediction.

Simply put, while the model is indeed 480 billion parameters in size, only 17B, or 3.45% of the model's parameters, are active for a given prediction.

Hence, you have the prowess of a huge model, and the cost of a model only 4% of its real size.

Please note that the reduction in costs and latency isn't exact, as MoEs only partition the feedforward layers (which account for the vast majority of FLOPs nonetheless) but the attention layers remain untouched.

What We Think

Despite the marketing stunt of the name they gave to their self-proclaimed king of ‘enterprise intelligence’, the truth is that Arctic seems like a very interesting option.

Good model, and affordable cost.

🧐 Final Thoughts 🧐

Considering this week’s news, it’s clearer as ever the two great trends in the AI space.

  • We ever-approaching transition into more powerful new types of AI models,

  • and the increasing improvements we are seeing at the deployment level with much more efficient models.

Both are necessary; one pushes the veil of ignorance forward, the other enables value creation with AI.

Until next time!

Do you have any feelings, questions, or intuitions you want to share with me? Reach me at [email protected]