• TheTechOasis
  • Posts
  • Evolutionary Model Merging, New Models without Training

Evolutionary Model Merging, New Models without Training

🏝 TheTechOasis 🏝

part of the:

In AI, learning is winning. While Thursday’s newsletter talks about the present, the Leaders segment on Sundays will make you aware of the future trends of AI with actionable insights to set you apart from the rest.

10-minute weekly reads.

🥇 This week on Leaders… 🥇

  • Leaders: Evolutionary Model Merging, New Models without Training

  • Did you know about… Google’s New Discovery, many-shot learning?

Welcome back to Leaders, the premium segment that brings light and prepares you for the future of AI.

For today’s segment, what if I told you that you will create the AI models of the future?

Sounds hard to believe, but this is the promise Evolutionary Model Merging (EMM) intends to deliver.

A group of Japanese researchers has found a way to automate model merging (creating new models without actual training), opening the door to a world where everyone gets to create their models without relying on the ever-more powerful incumbents.

It allows the creation of unique models, often bearing knowledge from highly unrelated domains and boosting the creation of models in underrepresented areas like non-English languages and minority cultures.

Consolidation And a Scary Sight

A quick overview of the scale increase that data and compute have seen over the past few years provides a stark image of how demanding AI is becoming and how embarrassingly far we are from meeting future requirements.

A Great Explosion

Back in 2022, Google Deepmind released a paper on measuring the optimal compute requirements for LLM training.

For an 8-billion parameter model, they recommended around 200 billion tokens of data, which amounts to around 150 billion training words.

Now, just two years later, Meta’s Llama 3-8B was recently trained for up to 15 trillion tokens, or 11,25 trillion words.

That’s 75 times more training data.

Compute-wise, according to Andrej Karpathy, the model consumed around 2e24 FLOPs (that’s 24 zeros), which is one hundred times more FLOPs than what Chinchilla (the aforementioned paper) recommended.

FLOPs are floating-point operations per second, the total amount of simple calculations the GPUs training Llama did. It’s the most straightforward way of measuring LLM training throughput and consumption.

While it’s important to note that the number is slowly approaching the embarrassingly random yet still mentioned 1×1026 threshold that Joe Biden’s AI Executive Order set for training AI models without government overview, it’s safe to say these numbers will be dwarfed shortly.

But why? Well, for two main reasons: the data effect and the compute paradigm shift.

Data and Thinking Modes

All Deep Learning today is based on imitation learning. Specifically, generative AI models are trained to replicate the training data. Thus, the higher quality, or ‘smarter’, the data is, the better the results.

And with the exponential increase of synthetic data, data quality will increase over time, as proven by models like Microsoft Phi-3, models that are already at the ChatGPT-3.5 level (the original ChatGPT) despite being in the 10-billion parameter range, as we discussed on Thursday.

Yet, scale also continues to increase due to scaling laws. As we are training bigger models, we need more data.

On the other hand, we are currently transitioning to a new, more demanding model class. But what do I mean by that? 

Today, models are heavily skewed toward large training FLOPs compared to inference. In other words, models are meant to consume many more FLOPs in training than in inference (than actually running the models).

But the tide seems to be changing toward inference in two directions:

  • agentic workflows

  • and search+gen.

As Andrew Ng proved, LLMs wrapped around iterative workflows, processes that allow the models to reflect on their responses to give them ‘more time to think’, dramatically improve their response quality.

In fact, using these workflows, ChatGPT-3.5 beats ChatGPT-4 (zero-shot, the standard form) basically every time, independently of the framework.

The intuition behind this is based on the idea that agentic workflows give models more compute bandwidth per exercise, instead of having to guess the correct answer in one shot as we usually expect from LLMs when we interact with them.

But ‘search+gen’ is even more interesting.

Akin to “System 2 thinking” based on the late Daniel Kahneman’s theory of thinking modes, it involves combining LLMs with search algorithms.

This has been hinted at by Demis Hassabis, CEO of Google Deepmind, as the quickest path to AGI, and highly rumored to be the secret sauce behind OpenAI’s heavily guarded new AI model, Q* (Q-star).

With ‘similar’ examples like Google’s Alphacode 2, the idea is to merge search algorithms like A* or Monte-Carlo Tree Search to, instead of immediately responding to your request, ‘explore the realm of possible answers’ until it finds the best possible response.

Similar to a human exploring different ways to solve a complex math problem, we give models ‘time to think’, which dramatically improves their understanding and reasoning capabilities.

Besides examples like Alphacode 2 or Deepmind’s AlphaGeometry, the latter used by Google to create a math olympic gold medalist, research by OpenAI here and here cleary indicates they are also going down this path.

This trend toward giving models more time to think during inference is defined as the ‘long inference’ paradigm, as models will allocate much more compute (or thinking time) to answering than to learning.

In fact, it’s rumored that Microsoft’s $100 billion datacenter project, Stargate, is based on this assumption too.

But why I am telling you all this? Simple, because both approaches will fire money in all cylinders in order to be executed.

A 3-Player Game

Simply put, with current energy availability, the extremely demanding AI training and inference pipelines will completely devour our energy grids.

Nevertheless, Sam Altman, OpenAI’s CEO, has said time and time again that energy scarcity is by far OpenAI’s greatest fear.

And he is putting his money where his mouth is, as he is personally investing into nuclear fusion company Helion, and also on portable nuclear reactors through Oklo.

As electricity demand skyrockets, the necessary cash to fund training AI initiatives will also increase heavily. Hence, if some of these start-ups are already raising billions and billions every year and still having to raise more, this trend will only get worse.

Consequently, it will be of no surprise to you that this trend will only incentivize extreme consolidation among those who can afford training or inference.

In other words, if we want to prevent ‘AI creation’ from becoming an extremely rich people’s game, we need to find ways to make model creation a cheaper process.

And here is where Evolutionary Model Merging comes in.

Making the AI Pipelines Smarter

Over quite some time, model merging has become a really hot topic in the AI space. And its influence is only going to explode due to its amazing benefits.

Among the most common ways of doing so, we have hybrid architectures, with examples like Jamba, which combines several different primitives such as the attention mechanism and Mamba to create powerful models that are also far cheaper.

Jamba block combining different models

But a group of researchers in Japan decided to take this to the next level.

Mixing Weights & Layers

There are two ways you can merge models:

  1. Across parameter space

  2. Across layer space

The former focuses on mixing the neurons of the model, choosing wisely which neurons to keep from each model. The latter focuses more on combining layers of the different models, a similar approach to Jamba, but on a more granular layer-by-layer approach instead of simply combining blocks.

Sadly, both methods are very hard to implement for two reasons: parameter compensation and knowledge path-breaking.

When we think about Large Language Models (LLMs) they are a bunch of neurons set up on different layers.

Intuitively, they are a compression of the training data, meaning they capture the key patterns allowing them to regenerate the learned data back word by word.

Fascinatingly, LLMs do not hold any external databases from which to retrieve data from (at least not in their standard form, RAG is another story). This means that the neurons inside the model are actually the elements ‘storing that knowledge’.

What’s more, knowledge is not evenly distributed, as each neuron will ‘specialize’ in several topics. Crucially, neurons will combine to elicit that knowledge, meaning that to generate the ‘next word’, a combination of thousands or millions of them is required.

This makes model merging very complicated, as:

  • At the parameter level, to merge models you have to discard neurons. Therefore, some knowledge will be lost.

  • And at the layer level, as you discard some layers in the model, you may be ‘breaking’ the knowledge paths (the specific neuron combinations) that allow the model to accurately represent a given domain.

So, what does EMM bring to the table to solve this?

Automated TIES and Layers Merging

EMM can be best summarized with the following graph:

Unlike previous methods, it involves the combination of both methods, parameter and layer space merging, but in a unique and very smart automated way that has yielded absolutely astonishing results.

Subscribe to Leaders to read the rest.

Become a paying subscriber of Leaders to get access to this post and other subscriber-only content.

Already a paying subscriber? Sign In

A subscription gets you:
High-signal deep-dives into the most advanced AI in the world in a easy-to-understand language
Additional insights to other cutting-edge research you should be paying attention to
Curiosity-inducing facts and reflections to make you the most interesting person in the room