TheTechOasis
Posts
Mamba, the ChatGPT Killer & The Great Deflation

Mamba, the ChatGPT Killer & The Great Deflation

Ignacio de Gregorio Noblejas
January 07, 2024

🏝 TheTechOasis 🏝

Breaking down the most advanced AI systems in the world to prepare you for your future.

5-minute weekly reads.

TLDR:

AI Research of the Week: Mamba, the ChatGPT Killer
Leaders: The Great Deflation

🎅 AI Research of the week 🎅

Two researchers have made the boldest move in years: throwing the biggest algorithmic breakthrough of the 21st century out the window.

Named Mamba, it achieves what was once thought impossible: matching or beating the Transformer, the architecture behind ChatGPT, while being faster and a lot cheaper.

Everyone seems to be talking about it, so let’s uncover what Mamba is.

The Gift that Keeps on Giving

Since its release in 2017, the Transformer architecture has become the ‘de facto’ choice for natural language modeling (models that generate text).

ChatGPT, Gemini, and Claude are based on this seminal architecture as its parallelizing capabilities, ideal for GPUs, sent recurrent networks, the previous state-of-the-art, into oblivion.

The intrusiveness of this architecture is such that the ‘T’ in ChatGPT stands for ‘Transformer’.

But what is the Transformer?

In very simple terms, it makes words ‘talk’ to uncover the relationships between them. This is known as the attention mechanism.

For instance, the attention mechanism over a pronoun will lead it to pay attention to its corresponding noun.

Works wonderfully, but at a huge cost.

Mastering inefficiency

Put simply, the Transformer yields impressive performance at the cost of being extremely inefficient.

To understand how models like ChatGPT process language, imagine you’re halfway reading a Harry Potter book.

However, to successfully read the next page, you have to store and review all the pages read previously in your mind.

Thus, to read the next chapter, you also have to remember every other detail read until now, be it relevant or not.

That’s how Transformers like ChatGPT process language.

Seems very inefficient, right? Yes, because it is.

Humans do remember many things read until that point, but we forget irrelevant data like Hermione’s summer activities. They are not relevant to the storyline, so our mind forgets about it.

What I am trying to tell you is that Transformers don’t compress context, they store it as a whole.

Naturally, as context isn’t compressed, the computation and, above all, memory requirements of Transformers grow considerably as the text sequence grows larger.

Specifically, quadratic and linear complexity for training and inference. In other words, in training, doubling the sequence quadruples the cost, and in inference, doubles it.

So what do research labs do?

To avoid costs spiraling out of control, they limit the ‘workspace’ of the model to a context window, which is why ChatGPT has a limited size of text they can process.

We also limit the context window because Transformer performance decreases over sequences longer the training context window.

This is called the extrapolation problem.

Over the years, many different architectures with more ‘efficient’ attention mechanisms have been proposed, but the loss of performance has prevented them from really substituting the vanilla Transformer.

So what did Mamba’s researchers do?

Into a Stateful World

The Mamba architecture stems from a critical question: Can we model language as effectively as the Transformer while being far more efficient?

And the answer is yes, thanks to what we define as ‘state’.

Circling back to the Harry Potter example, when we are reading a book, we keep an updated state; we slowly build a compressed understanding of what’s going on, keeping the key elements and rejecting the rest.

In essence, this is exactly what Mamba does.

Give me 3 ‘s’

Just like the attention module sits at the core of the Transformer, at the core of Mamba sits the Selective State Space Model (Selective SSM).

An SSM is a rather new language modeling architecture inspired by state space models from the 1960s.

In simple terms, the model keeps a ‘continuous state’ that serves as context. If the current input is ‘Harry’, the SSM will use its state to infer that the next word is probably ‘Potter’.

However, ChatGPT can do that too, right?

Yes, but the key thing here is that Selective SSMs keep in memory only the context that matters, making them more efficient.

Precisely, if the sequence doubles the cost of training doubles (Transformers quadruple) while the inference cost remains constant no matter the length!

As the state has a fixed size, the memory cost doesn’t increase no matter how long the sequence is, the model simply stores the key information and forgets the rest.

But how can Selective SSMs choose what context to keep and Transformers can’t?

To choose or not choose, the question is

What makes Mamba unique is that its SSM module is selective, i.e. chooses what context to keep and what to discard.

Like any other state space model, the model is driven by two equations:

With parameters A, B, we first obtain the state h, and then use C to compute y.

These equations have existed for decades, but unlike all previous architectures, Mamba is input and time-dependent.

Selective SSMs

As we want the model to compress context, it has to be capable of selecting what should be used as context and what not.

For instance, if the next word is ‘um’, that is probably not very relevant to context, and, thus, gets rejected.

To allow this phenomenon, Mamba introduces a new paradigm, where the weights of the model depend on the input and change over time.

In the Transformer, the weights are fixed, meaning that it can’t choose what to keep or not. Mamba, however, by making the weights a variable of the input, can adapt to every input.

What does parameter ∆ do?

Serves as the selective gate, ultimatemly deciding how much importance the model is going to give to the input x.

Model-wise, this SSM module is inserted into a Mamba block, and stacked homogeneously to build the actual model.

Mamba block

But if the key was to simply make the weights input-dependent, why wasn’t this done in earlier State Space Models?

State Space Models are recurrent models, meaning that input is treated sequentially. Or to put it another way, their computations aren’t parallelizable… unless you make the model Linear-time invariant, or LTI.

As LTI models are linear and don’t change over time, we can apply a convolution to the two previous equations without having to materialize in memory the hidden state, reducing costs drastically and parallelizing the computation.

With convolution, we ‘jump’ step 2.

And as a GPU’s performance is maximized when parallelizing computations, something Transformers can do, they became the default choice.

However, by making the weights of the model input-dependent you lose the convolution capability. To solve this, Mamba is hardware-aware, meaning it only materializes the hidden state into memory at the super-fast SRAM level (depicted in orange) and the rest of the parameters in HBM (depicted in green).

In other words, they drastically reduce the number of I/O memory events, which in turn means that, although the model is recurrent, is still as efficient as Transformers.

But the real question is: how does Mamba fare in comparison to Transformers?

Exciting, but Questions Remain

Tested at different small sizes (up to 7 billion parameters), Mamba beats every other model, including GPTs of similar sizes both in perplexity (a measure of how well a model predicts the next token) and accuracy.

It even manages to match the results of a Transformer more than two times its size.

Also, it shows no accuracy decrease with increased length, something unheard of until now:

However, although its preliminary results open the discussion that we might be seeing the end of Transformers and AI as we know it, Mamba has yet to be proven at big sizes.

But if these results extrapolate to LLM state-of-the-art sizes, we can confidently say this is the end of ChatGPT as we know it, and we could soon see the birth of ChatGPM, with ‘M’ for Mamba.

🫡 Key contributions 🫡

Mamba is poised to become the first architecture to put the throne of the Transformer at risk
With its selective mechanism, it adapts to the input, maintaining a compressed state that makes it much more efficient than the Transformer while being as performant

👾 Best news of the week 👾

🥰 OpenAI launches official prompt engineering guide

😍 The Transformer architecture in numbers

🧐 OpenAI’s GPT Store finally launches next week

🔑 Microsoft includes Copilot key in their new keyboards

🥇 Leaders 🥇

The Great Deflation

A year ago, 100% of economists in a very famous survey thought a recession was coming to the US.

And all of them were wrong.

A year later, the US economy looks as strong as ever, and inflation has fallen to 3.14% in November 2023, down from 7.11% the previous November.

This is good, as inflation eats our savings alive, but the greatest deflationary event of our lifetimes could be just around the corner, AI.

And that’s another story.

A force that will displace millions of workers, eliminate job openings, and commoditize several skilled jobs to the despair of many and all in the name of progress.

Here’s why.

Technology and Inflation

There are plenty of reasons to convince ourselves that this time it won’t be different.

As proven by many studies including this one, all industrial revolutions, all ‘technological’ events caused deflationary pressures.

Prices are increasing…

Touted as the fourth industrial revolution, AI is largely seen as yet another deflationary event.

But there are reasons to believe otherwise, right?

LLM training is undoubtedly inflationary, cranking up the prices of GPUs across the world as big tech companies hoard this coveted hardware like there’s no tomorrow.

And if we consider that the GPU market is basically an oligopoly of three competitors, NVIDIA with almost 90% and AMD and Intel sharing the crumbs of the pie, with demand at historic highs, there’s no reason to believe prices will fall.

Adding to this, a lot of compute these days is focused not on the present, but on the future.

While we are still twiddling with GPT-4, months have gone past since OpenAI started working on GPT-5, or whatever they want to call it, requiring more and more compute as models get bigger.

And the biggest proof of the extent to which this is an expensive activity is the various investment rounds these companies are raising, raising billions almost quarterly.

However, despite the obvious inflationary effects of AI today, this technology is going to be all but that in the not-so-distant future.

…But not for long

Whoever has worked on AI at the enterprise level for the last few months, knows that AI is largely going to be used to cut costs.

In a recent study by Bank of America and outlined by BlackRock, almost 6 out of 10 equity analysts argue that AI will be mainly used for cost savings.

Source: Bank of America

But we don’t have to reach out to 2024. There are three main ways this is already happening: productivity enhancements, market competition, and automation.

Firstly, AI significantly boosts productivity by automating tasks and processing data much faster than humans can. This increase in efficiency lowers the cost of production. As these savings are often passed down to consumers, prices tend to decrease, contributing to deflation.

In a famous Harvard study, BCG consultants using GenAI worked 25% faster and their outputs were considered 40% better.

Secondly, AI levels the playing field. In that same study by Harvard, AI dramatically leveled output quality among less-skilled participants, increasing quality by over 40%, and 17% among higher-skilled. But it also levels the playing field among companies, as companies with much less traditional resources are capable of much higher throughputs, forcing larger companies to decrease prices to stay competitive.

Alternatively, these large companies can create economies of scale to decrease prices dramatically and forcing smaller players out.

Either way, prices fall.

Source: Financial Times

Lastly, AI-driven automation dramatically reduces labor costs and increases operational efficiency. AI has considerable synergies with technologies like RPA and rules-based chatbots, and huge implications for core software like ERPs, CRMs, and such. As big players in these markets embed GenAI into their products, these core systems will become more and more automated to require less maintenance, potentially pushing license costs down too.

But the biggest impact is going to be felt by, unsurprisingly, you.

Subscribe to Full Premium package to read the rest.

Become a paying subscriber of Full Premium package to get access to this post and other subscriber-only content.

Upgrade

Already a paying subscriber? Sign In.

A subscription gets you:

• NO ADS
• An additional insights email on Tuesdays
• Gain access to TheWhiteBox's knowledge base to access four times more content than the free version on markets, cutting-edge research, company deep dives, AI engineering tips, & more