An AI Miracle & ChatGPT is Bullshit

🏝 TheTechOasis 🏝

part of the:

Welcome to the newsletter that keeps you updated on the latest developments at the cutting edge of AI by breaking down the most advanced systems in the world & the hottest news in the industry.

10-minute weekly reads.

🚨 Week’s Update 🚨

Welcome back! This week, we have news from Figma, OpenAI, AI-generated ads, and an exciting way to kill hallucinations.

Figma has launched a series of AI features (video included) that help users with prototyping.

I wanted to share this because Figma is one of those great examples on how to survive AI. Instead of AI being the product, AI makes the product better.

It’s those types of business models the ones that will outlive AI; the others, well, not.

Toys” R” Us has launched an entirely AI-generated ad telling the life of its founder, Charles Lazarus.

The scenes are so obviously AI-generated that it’s very creepy, and serves as a stark reminder that

1. humans are still needed in the loop,

and 2. video models don’t understand the world. 

I already shared with Premium members that Etched has raised $120 million from investors such as Peter Thiel to create the first Transformer ASIC. In layman’s terms, they are creating the first-ever chip specifically designed to run Transformer architectures, (like ChatGPT), and nothing more.

All frontier AI models today, ranging from ChatGPT to autonomous driving vehicles, are based on the Transformer.

That said, creating an Application-Specific Integrated Circuit (ASIC) is a bold move, as if for whatever reason the Transformer falls out of grace, this company is dead (acknowledged by the CEO).

Imbue, a company that has raised hundreds of millions of dollars to create agents that can reason and code, has released a 70B model that outperforms GPT-4o zero-shot (giving the models no examples to context from) in a series of benchmarks, a very impressive feat.

It’s an interesting find because the key to this performance seems to be on the training procedure (which they have fully open-sourced).

Moreover, it signals how the industry is obsessively going larger while there’s so much room for improvement at smaller sizes.

Moving on, OpenAI has delayed its much-expected ‘voice mode’ from June to at least the Fall of this year, adding to the company's long list of delays.

I’m a ChatGPT user, let me be clear.

That said, I really prefer Anthropic’s approach to releasing products; they announce and release at the same time. OpenAI, in pure Silicon Valley style, promise the world in exchange for continuous delays.

Lamini, a start-up enterprise LLM platform, has released a research paper that claims it can reduce hallucinations by up to 95% through a new type of fine-tuning.

If this is true, this is HUGE.

If nonexistent demand is the main reason AI is in a bubble, hallucinations are the reason demand is faltering. If this method really increases reliability up to 95%, that’s Ulysses sirens for companies.

Finally, on the latest round of ‘please don’t make this real’, Google is doing something Meta already tried in inspiration from allowing you to chat with your favorite celebrities or even YouTubers.

I hate this with a passion.

According to a recent study, our youngest generations are so unaccustomed to real interactions that they couldn’t even sustain eye contact with the interviewer during job interviews. In fact, the study concluded that companies are, literally, avoiding hiring this generation altogether.

In the meantime, money-hungry corporations who monetize attention are preying on our kids’ tribal desire for connection by creating easy-to-anthropomorphize AIs of their favorite celebrity, doing anything but worsening the chances of survival of that kid in the real world.

I don’t mind a 30-year-old talking to chatbots; our kids, however, should be on the street talking to other kids, not AIs.

🧐 You Should Pay Attention to 🧐

  • ChatGPT is Bullshit

  • SAMBA, An AI Miracle

🤯 ChatGPT is Bullshit 🤯

A paper appropriately named ‘ChatGPT is Bullshit’ has been making waves in the industry for one single reason:

They claim ChatGPT is a bullshit machine.

I apologize for the extensive use of the word ‘bullshit’ in this piece; it’s not me, it’s the literal word researchers used.

But what do they mean by that?

The True Nature of AI

Whenever a Large Language Model (LLM) gets something wrong, we say the model has ‘hallucinated.’

This is because, as LLMs are stochastic (pseudo-random) word generators, there’s always a non-zero chance that the model outputs something unexpected that deviates from the truth.

And let me be clear: this is done on purpose.

As there are many ways to express the same thought or feeling in natural language, we train our models to model uncertainty. To do so, we don’t make them decide an exact word for each new prediction, but we force them to output a probability distribution.

In other words, as seen below, the model ranks the words it knows (its vocabulary) according to how statistically reasonable they are as a continuation to the input sequence.

However, counterintuitively, we don’t always choose the most probable word. In fact, we randomly sample one of the top-k words, as all are probably reasonable continuations (in the image above, all 5 options are semantically valid).

This is done to enhance the model’s creative capacity, which is sometimes desirable and is thought to enhance the model’s language modeling prowess.

LLMs include a hyperparameter, named ‘temperature’, that allows you to control how ‘creative’ you want the model to be.

But whenever the model gets this process wrong and outputs some outlandish claim, is it really ‘hallucinating’ the way humans do?

Anthropomorphizing Robots

Researchers say that this is blatantly wrong.

A hallucination implies an incorrect perception of reality that makes someone generate statements that are not grounded in reality. But that’s the thing: LLMs aren’t capable of perceiving reality.

They see reality through the lens of text, preventing them from truly experiencing reality.

This thought process would probably consider that our recently discussed ‘Platonic AI’ isn’t entirely accurate either (or at least incomplete), as models lack the perceptive capacity to observe reality: it’s observing a human-generated representation of reality (text and images), which isn’t reality itself.

So while models may be converging, they still need to be endowed with the capacity to experience reality.

For that reason, calling it ‘hallucination’ does more harm than good. But why not just call it lies?

Understanding ChatGPT’s goal

Researchers state that saying that ‘ChatGPT lied’ misrepresents the true nature of LLMs. To lie, someone has to be aware of the truth about something and choose to give an alternative inaccurate statement on purpose.

This is NOT what ChatGPT does.

In fact, the team argues that the model can’t possibly be aware of truths and lies because it’s not really trying to tell the truth; it’s simply imitating human language.

For that reason, ‘bullshitting’, or spreading inaccurate statements without being aware of their inaccuracy, is a term that applies to LLMs more.

But why?

Insofar as the model ‘speaks the truth,’ it is only as accurate as the truthfulness of its training data. The model doesn’t evaluate the truthfulness of each word and statement; rather, it generates responses based on statistical patterns and probabilities independently of their truthness or falsehood.

In other words, to ChatGPT, if two generations are equally statistically reasonable but one is true, and the other is false, the model really doesn’t care which one gets outputted, as both are meeting its goal of reasonably imitating human language.

TheWhiteBox’s take

The question of whether LLMs understand meaning remains unanswered, as Brown researchers showed.

Of course, one can argue that as our models ingest better-quality data and improve their compression capability, ‘true’ statements will be more statistically reasonable to the model than ‘false’ statements.

However, as long as the models aren’t capable of seeking the truth (because they are unaware of its existence), underrepresented truths in the training data will tend to induce the model to ‘hallucinate’ or, to be more precise, ‘bullshit its way’ into a false answer.

💃🏼 SAMBA, An AI Miracle 💃🏼

Recently, Microsoft’s AI game can be summarized in one way: Small Language Models, or SLMs. In other words, they are obsessed with building powerful AI models that are also cheap to run.

Now, they have announced SAMBA. And let me tell you, based on its bold promises, this one is hard to believe.

They seem to have found the perfect hybrid architecture, smarter and faster than all models of its size and, importantly, defeating one of frontier AI’s biggest issues.

Fascinatingly, with Samba, we aren’t only immersing ourselves in an exciting new architecture but also about to gain a deep intuition about how frontier AI models work.

The Neverending Problem

Everything in frontier AI today is a Transformer. LLMs, SLMs, video models like Sora, image generators, and even autonomous driving systems like Tesla’s FSD software.

Everything revolves around one seminal architecture. But why?

Mixers Just Too Good

Although I recommend reading my blog on Transformers for full understanding, they are coupled with two components.

  1. Token mixer: Known as the attention mechanism, token mixers help words in a sequence interact and update their meaning concerning other words in the sequence, like ‘bat’ updating its meaning with ‘baseball’ to know it’s not referring to the animal.

  2. Channel mixer: Known as MLPs or Feedforward layers (FFNs), they provide core knowledge from the network to words in the sequence, like having the token ‘Michael Jordan’ and adding the concept of ‘basketball’ to it so it becomes the legendary player and not the Hollywood actor.

MLPs are also essential to approximate non-linearities (allow the network to learn non-linear relationships). If you want to fully understand why they work, read here.

Importantly, Transformers have two great powers:

  1. Global attention: The attention mechanism ensures that any long-range dependency will be identified, making them excellent fact retrievers.

If they are reading Chapter 10 in a book, if a critical piece of information was mentioned way back in Chapter 1, the Transformer will pick on it.

  1. Hyper-parallelization: In Transformers, everything is parallelized, making them a perfect choice for parallel computation hardware, aka GPUs, and making them ideal for large-scale training.

In summary, Transformers is an architecture that can detect any patterns in data and endure huge amounts of training.

Thus, when deployed at scale, they become excellent data compressors, aka models that can ingest huge amounts of data much larger than their actual size and replicate it generatively thanks to their capacity to find the hidden patterns in language (like grammar).

Sadly, however, they are extremely inefficient and, importantly, have one huge vulnerability.

Cost And Extrapolation

For starters, they have quadratic computational and memory requirements.

  • If the sequence doubles, the compute and memory required quadruple.

  • If the sequence triples in size, both increase ninefold.

The reason for this is that Transformers can’t compress memory. This means that they have to see the entire sequence every time for every prediction, like if you had to read all chapters of a book every time you wanted to read the next word.

As they can’t have a ‘summary’ of what they read in the past, they simply reread it every time (literally). And although we have ways to make this process less redundant (the KV Cache), you will agree with me that it’s extremely inefficient.

To see how expensive to train and run Transformer-based LLMs are, read my article on energy constraints for a real-life example and insights.

Additionally, they suffer from the extrapolation illness.

When you measure a model's quality, you measure its perplexity or confidence in the current prediction. The higher the perplexity, the more unsure it is about its prediction, so we aim for lower values.

When training language models, you often decide the length of training sequences (or the maximum length each training sequence is going to be).

Ideally, you would want to train the model in smaller sequences to decrease costs (for reasons explained earlier), then let the model ingest any size on inference.

Sadly, this is not possible, as Transformers with global attention can’t extrapolate to longer sequences. In layman’s terms, if you send the model longer-than-training sequences on inference, the perplexity skyrockets (the model can’t decide for certain what’s the next word to the sequence, making it useless).

Consequently, researchers cap the maximum sequence length users can send the model.

This also forces them to train the model on extremely long sequences, making the process extremely costly, sometimes multiple hundred million.

According to Dario Amodei, Anthropic’s CEO, that value will overpass a billion by 2025 and $100 billion by 2027… for a single model.

For those reasons, for years, researchers have tried to find better alternatives to attention or interesting ways to combine it with other architectures.

And while the first was a complete failure, the second one has just seen its biggest breakthrough: Samba.

Dancing Costs Away

Amongst the many ‘Transformer challengers’, few have received more hype than Mamba, a Selective State Space Model.

Efficient, but…

In succinct terms, unlike Transformers, Mamba carries a recurrent, compressed state of the past in a similar fashion to humans (we don’t remember every single detail of our past experiences, just the key points), giving us the diagram below:

What does it term represent?

xt represents the input, ht the recurrent state, △ the selective mechanism, and A, B, and C are matrices that decide whether the recurrent state of the input are more relevant to the new output.

It looks fairly daunting, but it’s not. In simple terms, it's a double-gated mechanism that asks one question for every new input:

Is this input relevant?

  1. If so, it updates the state's value with it and assigns it a lot of weight to generate the next output.

  2. Otherwise, it ignores the input and relies on the recurrent state to generate the next output.

The simplest example is words like ‘um’, typical of oral text, which the model ignores because they provide no value to predict the next token.

Crucially, as Mamba holds a fixed memory, its memory requirements on inference are fixed no matter how long the sequence is, and the computational costs increase linearly (unlike Transformers, where both increase quadratically).

In other words, Mamba is much more efficient to run than Transformers. However, these architectures suffer to model non-Markovian dependencies.

A Markov decision process (MDP) defines the new state as determined by the current state and the last input, not by states from the past, which is precisely what Mamba does.

Thus, a non-markovian dependency could be an important fact that the model did not pick on and added to the compressed memory, so the model basically forgot it.

For that reason, Mamba fails to retrieve some information from the past. So, if Mambas alone can’t defeat Transformers… why can’t we just mix both?

Just Beautiful

In simple terms, Samba is a mixture between Transformers and Mamba. But why use this specific combination?

In the paper, researchers argue that each component plays a key and specific role:

  • Mamba: The backbone for efficient decoding, its recurrent state allows the model to identify the sequence's overall—and relevant—semantics.

  • SWA: Slided-window attention, the attention mechanism that allows for fact retrieval from the past (older parts of the sequence).

Unlike standard attention, SWA slides the attention mechanism, meaning the model can only perform attention on the last ‘x’ tokens. If the model is looking at word number 4,500, and has a 4,000 token window, the model can only ‘see’ words between the 500 and 4,500 range.

Thus, the window slides across the sequence, reducing compute and memory requirements.

  • MLP: They are in charge of factual knowledge recalling. They embed core knowledge learned by the model during training into the process so that the model can output words or concepts not explicitly mentioned in the input sequence but are part of the model’s knowledge.

Disclaimer: Fact retrieval and Fact recalling are different things. The former (attention) refers to retrieving data from the actual input sequence. The latter (MLPs) refers to recalling facts not present on the input sequence but part of the model’s core knowledge.

For more detail on why MLPs act as ‘fact finders’, read here.

Consequently, you get the best of all worlds:

  • Highly efficient decoding based on Mamba,

  • with attention retrieving facts from the sequence that the Mamba layers might have missed (as mentioned earlier),

  • and MLPs embedding model knowledge into the mix.

Role of every Samba component. Source: Author

Now, this is all very interesting, but show me proof this is the real deal, right?

A New Small State-of-the-Art

The most important aspect of SAMBA is that it doesn’t suffer from the extrapolation problems we discussed earlier on Transformer-only models.

For instance, while only trained on 4k-long sequences (3k words), the model performed well (even reducing perplexity) for sequences up to 1 million tokens (750k words).

Unlike in Transformer-only cases, you can send Samba sequences with up to 750,000 words and the model works fine, even if it was trained with sequences just up to 3,000 words.

This savagely reduces training costs and makes the model more useful on inference.

For instance, if we look at models with global attention, like the LLaMa 3 models (red and green), their perplexity skyrockets once the extrapolation problem kicks in (at the 16k range), making them unusable for long sequences.

The pattern repeats throughput-wise, with SAMBA achieving a performance of almost 500 tokens/second even when the sequence is around 100k words long.

You may have realized Mistral-1.6B’s performance seems to hold its own despite being ‘Transformer-only’.

The fact that Mistral’s models also use slided-window attention signals that this attention method seems to be superior to vanilla attention (ChatGPT/Gemini) and Grouped-Query Attention (LLaMa or Nemotron), at least to prevent extrapolation illness.

That said, Samba is still superior, as when performing retrieval exercises in very long sequences, it is considerably superior to Mistral-1.6B.

On a more structured format, Samba also displays better perplexity results in almost any test configuration regarding models of similar size.

Moreover, it’s also superior to the Phi family, the SLMs being used, among other cases, as the on-device language models for Microsoft’s new Copilot+PCs:

On a final note, although not released yet, they have a 3.8 billion parameter model that outperforms all sub-10-billion models worldwide. This proves they might have finally figured out something researchers have been looking for years:

An alternative architecture worth scaling.

TheWhiteBox’s take:

First and foremost, I like that Microsoft is publishing this very open research; every new open-source, powerful model must be celebrated.

We have also learned a whole new lot about how these architectures work, as we gain intuition about the role of each component in the overall process.

Particularly interesting is the idea that using standard attention, the go-to option for almost seven years, is detrimental to performance in this case.

At least to my knowledge, this is the first time that not using standard attention wasn’t only about saving costs, but actually protecting the model's performance, to the point they might have figured out a way to avoid the extrapolation illness for good.

🧐 Closing Thoughts 🧐

AI researchers continue to redefine what’s possible in terms of AI efficiency, to create powerful models that compete with state-of-the-art despite being much smaller.

However, it seems that many people in the industry are growing tired and skeptical of LLMs, arguing that too much focus is put into them, and their promises of outsized performance are not based on facts.

Interestingly, it seems that open-source and hybrid architectures are becoming a very interesting approach for enterprises, as they

  1. offer fine-tuning and full control of the LLMs, and

  2. offer higher cost-efficiency; at this point, choosing to go with the ‘big guys’ is, in my view, a decision that should be severely scrutinized.

Long story short, open-source can’t be ignored anymore. And if markets realize this, well, in what position does that leave the companies that have been growing in the trillions over the last few months and years? 

Do you have any feelings, questions, or intuitions you want to share with me? Reach me at [email protected]