• TheTechOasis
  • Posts
  • NVIDIA Powerhouse model and Mixture-of-Agents Beats GPT-4o

NVIDIA Powerhouse model and Mixture-of-Agents Beats GPT-4o

🏝 TheTechOasis 🏝

part of the:

Welcome to the newsletter that keeps you updated on the latest developments at the cutting edge of AI by breaking down the most advanced systems in the world & the hottest news in the industry.

10-minute weekly reads.

🚨 Week’s Update 🚨

Welcome back! This week, we have news from start-ups founded by legends, SoftBank’s latest crazy idea, and an obvious truth about OpenAI.

To start, Ilya Sutskever, the OpenAI co-founder who recently left, has started a new start-up focused on solving superintelligence.

Ilya has always been very worried about the impacts of AI in our lives, surprising no one when he left OpenAI. Based on the new company’s mission, actions speak louder than words.

Talking about AI legends, Geoffrey Hinton, one of the ‘Godfathers of AI’ (although the appropriate reference would be ‘of DeepLearning’), has announced its support for CuspAI, whose aim is to discover new materials that can perform ‘carbon capture,’ among other uses.

In what’s becoming a heavily crowded market, the idea of using AI to discover new materials is very exciting. That said, it comes across to me as the type of industry where industry leaders, like Google or OpenAI, might want to verticalize into, making me not very optimistic about the future of start-ups in the space.

For instance, Google Deepmind already has GNoMe, which I wrote about.

Moving on, one of the wildest start-up ideas ever has been announced: Butterflies. It’s a social media app where humans and AIs (named butterflies) co-exist and interact, raising $4.6 million.

You can join and create as many Butterflies as you need, and these AIs will start uploading photos and interacting with others autonomously.

I have two things to say about this. I hate this idea, and I really, really hate this idea. Really, are we losing our minds?

Hardware-wise, to celebrate NVIDIA’s new crown as the most valuable company in the world, they have released results that prove that their GPUs are the best out there, beating others like Intel’s Gaudi 2 GPUs.

These days, it seems NVIDIA can’t do no wrong. At this point, only two things can stop this beast: a Chinese blockade in the Taiwan Strait (halting advanced chips production), and limited energy grids, the latter of which you will receive an article tomorrow on the matter.

Luckily, nothing seems to indicate the former will take place anytime soon, and we have time to solve the latter problem… in theory.

Furthermore, it’s time for the ‘ol reliable’ OpenAI controversy. People have recently become extremely skeptical about OpenAI’s actions.

For starters, they are seriously considering becoming for-profit, signaling that their transition to closed-source is definitive. That’s totally fine; at least they are finally open about it.

But the moves that worry people are more ‘under the hood.’ To date, the start-up has already more than 35 lobbyists, which Bill Gurley, a famous VC investor, has used to claim that they are unequivocally searching for regulatory capture.

Trying to earn in Washington what they potentially can’t in the open field is not what you want to see from a company who many claim ‘is so far ahead’.

Considering Mira Murati’s recent statements, maybe… they aren’t?

Finally, in what can be considered SoftBank’s latest weird vision, they are internally building an AI tool that will soften the tone of angry customers when interacting with customer support.

Naturally, the announcement has been met with mixed reviews. Some praise indeed, but others compared it to putting nets around a sweatshop’s windows to prevent workers from killing themselves by jumping out.

I understand where this is coming from, but if you get many angry customers, maybe you’re the problem?

🧐 You Should Pay Attention to 🧐

  • Nemotron, NVIDIA’s GPT-4 level New Model

  • Mixture-of-Agents Beats GPT-4o

👾 NVIDIA’s Nemotron GPT-4 Level Model 👾

NVIDIA, not having enough of being the greatest story ever told in the markets (not only the most valuable company in the world, but also representing almost half of the gains of the entire S&P500 over the year, utter madness), has released a new model.

As we shared in the previous article on their deep dive, they are venturing into new areas of AI besides GPUs. Now, they have announced a new model, Nemotron-340B, that beats GPT-4o (and any other model that dares to compare) in some specific areas.

Moreover, this release also includes fascinating information, such as the fact that this model excels at synthetic data generation (allowing users to generate specialized data to train their models) and, crucially, shows proof that weaker AIs can train stronger AIs, a counterintuitive yet critical requirement for safety training.

An Optimized Beast

In short, Nemotron-4 340B is a decent-sized Large Language Model (LLM) that excels at crucial tasks in today’s world such as rewards and, above all, synthetic data generation.

A Predictable Architecture

When it comes to the architecture itself, there aren’t many surprises, but the confirmation that Grouped Query Attention (GQA) has become the norm.

Without going into much detail because it’s not the point of this article, LLMs cache (store) some of the computations they perform during inference over successive word predictions, something we know as the KV Cache.

But what do we actually store?

LLMs use a token mixer, the attention mechanism, to process data in the input sequence by updating the meaning of each word with regard to the other words in the sequence.

Each word has a query, a key, and a value vector; the query of one word is used to ‘talk’ to other words’ keys, and the value vector is used to update the meaning of each word concerning previous words in the sequence.

This exercise is repeated multiple times with circuits we know as ‘attention heads’, greatly increasing computation—and memory—requirements.

However, GQA proposes grouping these circuits, effectively reducing the size of the KV Cache, which can grow enormously (in fact, it’s the most limiting memory factor in LLM engineering for large sequences).

See below an overview of attention to see what is actually going on inside ChatGPT and others:

An attention layer, generated by author

In the example above, if we have 8 attention heads and 4 groups, the KV Cache increases by a factor of (8/4 = 2). In non-GQA attention, the KV Cache would increase by a factor of 8.

GQA is used by many powerful models today, including the LLaMa (Meta) and Qwen (Alibaba) families, two of the most powerful LLMs today. And NVIDIA's adoption of this method indicates that it’s becoming the norm.

Besides this noteworthy technique, the rest is standard, a decoder-only architecture (as all LLMs today).

Notably, they also reduced its size by training it at float8 precision, which means that each parameter in the model occupies 1 byte in High-Bandwith Memory (HBM, the GPU’s RAM) instead of the standard 2.

This effectively reduces the model size from 680GB to just 340, which means the model can be easily deployed into a single DGX H100 node (with 8 H100 GPUs), with effectively 4 out of the 8 GPUs completely free for storage and compute.

In reality, despite the model fitting in just 4 GPUs, it is distributed throughout the 8 to make the deployment more resistant to failures.

Deploying models in a single node is crucial, considering that AI GPU nodes consume 10 times more energy than standard internet ones.

What are the unit costs?

A DGX H100 (the state-of-the-art rack everyone is buying) consumes 11 KWh of energy hourly.

At the average industrial electricity US tariff of 0.083 USD/KWh, that’s $8k/year. Still much, much less than the capital costs of buying the node (around $160k).

However, at scale, this number becomes millions of dollars per datacenter (a 20k GPU cluster consumes $28 million a year in energy costs), so every dollar counts.

But I'm not writing about Nemotron today because of GQA and efficient deployment; synthetic data generation and weak supervision are the most exciting aspects of this project.

A Synthetic Data Generator

One of the primary use cases NVIDIA points out Nemotron is great at is synthetic data generation, which offers significant advantages in developing high-quality training datasets.

This capability is critical when collecting human-annotated data, which is expensive and time-consuming and which enterprises are desperate for as they mature their open-source developments.

Crucially, NVIDIA has demonstrated that Nemotron's data can be used to train other models by aligning it (modeling its behavior) almost entirely with the data it itself generated, as over 98% of the data used in the model alignment process was synthetically generated.

In particular, NVIDIA crafted an ingenious data generation pipeline to train the models.

The pipeline proposes different topics for generator models to be trained on, generates data on those topics, and runs Nemotron through that data to fine-tune it to the topic and task.

One example of prompts they gave the model is shown below:

But NVIDIA went further.

Not only did they use synthetic data, but also weaker models to align stronger ones, a concept popularized by Anthropic’s superalignment lead and ex-OpenAI Jan Leike known as weak-to-strong alignment.

Alignment refers to modeling a model’s behavior, preventing it from answering harmful questions like ‘Help me build a bomb.’

Nevertheless, as we saw in the previous newsletter about jailbreaking LLMs, this is actually a very weak alignment that can be easily reversed.

However, this approach involves using an initial aligned model to generate synthetic data, which is then used to train a better base model.

This new base model, being stronger, generated even higher-quality synthetic data in subsequent iterations, creating a self-reinforcing cycle of improvement, proving that worse models can model the behavior of superior ones.

Weak-to-Strong Alignment is considered potentially key when the time comes and humans need to align models more powerful than ourselves.

Besides this exciting way of aligning models with synthetic data and using weaker models to improve better ones, NVIDIA also had time to create a specific reward model that provides reward feedback better than GPT-4o (and any other model).

If we look at the previous diagram, we need a reward model that chooses the best response over a pair to provide that feedback to the model and improve it.

The mathematics are straightforward here. The better our reward model, the better our trained model turns out. And Nemotron-4-340B Reward is best-in-class when run through the most popular reward model benchmark.

Not only that, despite being a healthy six times smaller than GPT-4, it beats it, on average, on human-graded comparisons for many different tasks:

TheWhiteBox’s take

Well done, NVIDIA.

They have presented a family of models that can be run efficiently in a single DGX node (just 8 GPUs) while having six times fewer parameters and occupying twelve times less memory than models like Claude 3 or GPT-4 (340GB vs. 4TB), all the while being extremely competitive with them.

It is also a great model for companies to create unique datasets to align their in-house models with bespoke synthetic data, a crucial next step for companies maturing their Generative AI efforts.

It has also been proven that weak-to-strong generalization works, being a more cost-efficient fine-tuning approach and an encouraging signal for superalignment efforts.

Finally, I want to applaud NVIDIA’s openness, following the example of Apple and Meta. Fighting closed-source requires the support of big tech companies like these.

And in NVIDIA’s particular case it’s also a matter of survival, as the potential disappearance of open-source would reduce the pool of potential GPU clients to less than 6 (all of which are building their own chips as we speak to loosen NVIDIA’s grip on them), which would be a horrifying event for the largest company in the world.

🕵🏽 Mixture-of-Agents Beats GPT-4o 🕵🏽

If you’ve followed me for a while now, you will have realized how excited I am about long-inference models. I believe—and academia does, too—that they are the next frontier for AI in terms of reasoning capabilities.

Now, a group of researchers at the start-up Together.ai (a company that serves LLMs) has published a paper on Mixture-of-Agents (MoA), how different LLMs can combine to produce much better results.

And by better results I mean beating GPT-4o in highly popular benchmarks while using a set of very inferior models.

But how on Earth is that possible? That’s the power of long inference. Let’s dive in.

Iteration is the Key

For those unaware of the definition, long inference Large Language Models (LLMs), instead of providing a straight answer directly, are allowed to iterate and self-reflect for a fixed or threshold-defined time to provide a much more ‘thoughtful’ answer.

As models are “given more time to think,” they obtain much superior results to the current generation, even when the base model, the LLM, is inferior. As you can see below, GPT-3.5, despite being notably worse in raw comparison to GPT-4, largely surpasses its might when wrapped around agentic workflows.

Source: Andrew Ng

My bet for quite some time has been that the next generation will all be long inference models.

If you’re wondering why this works so well, we aren’t entirely sure, but most researchers, such as Yoshua Bengio, will point out Daniel Kahneman’s thinking models: Systems 1 and 2.

  • System 1 is fast, unconscious, and intuitive

  • System 2 is slow, conscious, and deliberate

When driving a car, you don’t think about your movements; they are instinctive (System 1). When you work on a complex math problem however, you and your prefrontal cortex are fully engaged in solving it (System 2).

Long story short, while current LLMs are System 1 thinkers, long inference models mimic System 2 thinking.

For a mathematical proof, current LLMs allocate the exact amount of per-prediction compute independently of the complexity of the task.

Thus, with long inference, we aim to force the model to be ‘more thoughtful’ (predict more tokens) to solve complex tasks that require it.

In today’s models, one way to activate System 2 thinking is the famous Chain-of-Thought technique (asking the model to go step-by-step). Still, it’s nowhere close to the results of LLMs that actively search the solution space, as you’ll see below.

So, how do long-inference frameworks work?

A Collaborative Effort

As mentioned, academia has been doubling on creating collaborative frameworks for LLMs to maximize results for quite some time.

A little over a year ago, MIT and Google researchers presented Society of Minds, a collaboration framework for LLMs.

Despite having much less powerful models available back then (they used Bard and ChatGPT-3.5), these frameworks already showed some really interesting emergent behaviors, like two LLMs with initial wrong answers ending up getting the correct answer after sharing:

Perhaps more seminal research was that of Tree-of-Thoughts (ToT), also by Google but in collaboration with Princeton University, that established the idea that LLMs, when allowed to search the space of possible solutions, increase their results, especially in reasoning capabilities:

For some complex tasks, ToT increased performance tenfold, which was a wake-up call for the industry that these guys had made an interesting discovery.

A very recent piece of research from last week, Chain of Preference Optimization, uses ToT to generate good ‘thought sequences’ (depicted in green above) and fine-tune a standard LLM with them. This gives very powerful results withouth the need to perform active search, but it’s too early to draw meaningful conclusions.

By now, I assume you get the gist.

Long inference models maximize the likelihood of a correct answer by generating multiple plausible answers and refining them until they meet a certain quality threshold.

So, what does Mixture-of-Agents propose?

Many LLMs are Better than One

In a way, MoA is conceptually similar to mixture-of-experts.

In MoE, an LLM’s layers are broken down into ‘experts’ specializing in each set of neurons and, thus, only activating a fraction of the LLM’s parameters toward any given prediction.

Here, instead of breaking an LLM into parts, we create a ‘grand LLM’ full of smaller ones. As you can see below, we create ‘LLM layers’, layers packed with LLMs we call agents.

These agents are called proposers, because their job is to generate possible responses to the provided prompt.

Importantly, these layers are dense, meaning that the agents (LLMs) in the next layer receive all proposed answers from the previous layer. Consequently, these LLMs still have to answer the initial question, but can use the responses from agents in the previous layer to have deeper context to answer.

In this scenario, the agent can decide to discard the previous generations and create a new one, or simply refine the responses of models in the previous layer to improve the quality.

Finally, another LLM agent, called the aggregator, consolidates all accumulated information provided by the previous agents and builds the response the users receive.

And, well, this self-refining process shows very impressive results.

A set of open-source models, all individually considerably inferior to GPT-4o, demolishes the latter in the AlpacaEval 2.0 benchmark (an instruction-following benchmark).

In other benchmarks like FLASK, a more fine-grained evaluation, MoA is extremely competitive with GPT-4o, making the solution state-of-the-art for LLMs today.

FLASH evaluation metrics

Fascinatingly, even though this framework implies a much higher degree of generated tokens per prediction, it’s more cost-efficient than frontier models, both at raw cost and teraflops used per forward pass with regard to the actual model’s accuracy.

TheWhiteBox’s take:

With demand for GenAI products more of a dream than a reality today, all roads lead to poor reasoning as the main factor in this. And long inference models may soon change that.

However, what strikes me the most about this research is that, despite generating many more tokens on average, the overall solutions seem cheaper as the average model used in collaboration is smaller than frontier models.

As an MoA model with fewer parameters than GPT-4o still surpasses it, could long-inference be just a more efficient use of parameters and, contrary to what many believe, require less computing than individual-but-bigger models?

This would be the best of both worlds: better models with more efficient inference workloads. As we will see soon in the Leaders segment of this newsletter, we can't take this lightly; energy consumption from LLMs is a huge problem.

Therefore, opening the door for higher-performing yet cheaper solutions is exciting and, quite frankly, necessary for the industry’s survival.

🧐 Closing Thoughts 🧐

Compression. If there’s one conclusion from this week’s research, we are improving at building models.

Specifically, we are getting better results regarding model size, which argues that we are getting better at compressing world knowledge into smaller models.

Another point worth mentioning is how many companies are starting to release foundation models optimized for single tasks, such as NVIDIA’s Nemotron for synthetic data generation, following the lines of Apple, acknowledging the power—and limitations—of foundation models, decided to fine-tune for specific tasks for Apple Intelligence.

Additionally, the AI market seems to be accelerating, with dozens of start-ups appearing weekly, signaling that venture capital seems more than willing to bet on AI companies… for now.

Do you have any feelings, questions, or intuitions you want to share with me? Reach me at [email protected]