The Anti-LLM Revolution Begins

FUTURE
The Anti-LLM Revolution Begins

If you lift your head over the media funnel of AI outlets and influencers that simply echo Sam Altman’s thoughts every time he speaks, you will realize that, despite the recent emergence of OpenAI's New o1 Models, the sentiment against Large Language Models (LLMs) is at all-time highs.

The reason?

Despite the alleged increase in ‘intelligence’ that o1 models represent, they still suffer from the same issues previous generations had. In crucial aspects, we have made no progress in the last six years, despite all the hype.

With all AI money flowing directly into LLMs, much of it is staked on these models being the ‘path to AGI.’ Hence, the implications of this not being true would be disastrous, especially for markets.

Today, based on weeks of gathering the most compelling evidence possible, I present to you the most extensive review on LLM limitations backed up by fresh lapidary research by Apple, Meta, Google, and Princeton University, among many others, with all making the same claim: we are being deceived.

Today, I will open your eyes and provide you with an alternative, facts-based reality of what’s considered the golden ticket to AGI, LLMs, which might not look that golden to you in a few minutes.

Particularly, I’ll provide you with tough-to-find insights, such as:

  • Compelling evidence that LLM intelligence is primitive at best (I guarantee some of the results are going to make you feel disgusted about all the hype around LLMs).

  • Proof you are framing LLM intelligence incorrectly and that a new evaluation framework for frontier AI, called the task familiarity/complexity conundrum, or as I like to call it, the ‘bullshitometer,’ is the way to go and illustrates how incumbents are tricking you into believing LLMs are smarter than they really are

  • Playing devil’s advocate, I’ll give you the other side of the coin too, illustrating how LLMs could still play a significant role in AGI,

  • And we will end up with a reflection on how ‘fucked up’ this industry’s judgment is and how you should approach AI news from now on.

Let’s dive in.

What is the Real Frontier?

LLMs’ biggest praise and harshest critique come from the same realization: they are great pattern matchers… and that’s it. In other words, they can extract patterns in data (all domestic cat breeds have slit-shaped eyes) and apply them to new data (I see a new feline with slit-shaped eyes, I can infer it’s a domestic cat with a very high likelihood), but nothing beyond that.

But how superficial or deep are these patterns? Do they really understand what they ‘read’ or ‘see’?

While most people, upon seeing the LLM making a reasoned response, assume ‘LLMs do reason,’ plenty of researchers have risen to claim otherwise, and the proof is as surprising as it is overwhelming.

Problem 1: LLMs Are Really Bad Planners

Planning is an essential component of reasoning. It allows us to break a complex problem into simpler tasks that, combined, take us to the goal.

As we previously saw in this newsletter, Valmeekam et al. proved that while advanced models like o1 preview can make small plans, as we increase the number of steps, their performance drops rapidly to zero.

To make matters worse, they are humiliated by Fast Downward, an AI model published in 2006 (revised in 2011) that uses brute force search (trying different outcomes until it finds the best solution). While this system scores 100%, o1-preview gets 37.3% in the hardest benchmark, with the former being faster and cheaper.

But LLMs are underperforming not only a decade-old AI but also methods that are 50 years old.

Problem 2: They Aren’t Better than Decades-old Models & Can’t Follow Instructions

MIT researchers performed a series of tests for anomaly detection in time series data to analyze LLM performance against other methods. The results? ARIMA, a method developed in the 70s, outperformed them in seven of the eleven datasets.

While MIT’s piece was actually quite optimistic about LLMs, making the case that they outperformed other neural networks, and put into value the fact that LLMs had not seen task-specific data (at least not as a post-training fine-tuning) and still performed ‘okey-ish’, the undeniable truth is that, when going side to side with a 50-year-old technique, they lost.

Another eye-opening piece of research assembled a series of questions in which implicit knowledge (having past experience or knowledge in the task, which is usually the shortcut LLMs use to solve complex tasks) is useless.

Instead, the researchers provided all required instructions to solve the task in the prompt, similar to seeing a dishwasher manual for the first time but having all steps recorded for you to follow. In other words, they wanted to test LLMs in unfamiliar settings but with all required information available (we’ll see in a minute why this is crucial). Long story short, can LLMs perform simple instruction following?

As shown below, models offered mid-performance in short instruction sets, but performance dropped considerably the longer the problem.

If they are so smart, why is it so easy to make them fail simple tasks? But if you think this is bad, I’m only getting started. Our next research by Apple really shows the cracks in the LLM narrative in a way we had not seen before.

Problem 3: They Are Heavily Token-Biased & Easily Fooled

One of the most popular benchmarks to test LLMs’ prowess in math is OpenAI’s GSM8k, a series of grade-school level math problems, a test on which our so-called frontier LLMs have almost saturated.

Question/Answer GSM8k example. Source: HuggingFace

But Apple researchers posited: Are LLMs truly understanding what they are doing, or have they memorized the sequence of words? To test this, they developed a side-dataset of questions that introduced a series of modifications to the benchmark, including:

  • Varying numbers and names. Keeping the question identical, they changed the person's name or a key number. Naturally, this minor change isn’t making the problem harder or simpler; the output varies, but the process does not. If LLMs truly understand the problem, they should not care about irrelevant changes.

Source: Apple

  • They also added inconsequential clauses that seem context-relevant but are unnecessary in solving the task, evaluating whether the LLM can acknowledge this.

A seemingly relevant but inconsequential clause is added to fool the LLM.

And the results are eye-opening.

On the one hand, LLMs demonstrate huge token bias, aka the specific words used in the task matter a lot. Despite the identical problem, changing just one token can make the model fail. University of Pennsylvania researchers more extensively proved this, showing that even changing just the name of the person caused LLMs to fail despite the rest of the prompt being identical:

Why does Linda work but Bob doesn’t?

The example above is the famous conjunction fallacy, when people think that a specific set of conditions is more likely than a single general one, even though that’s not logically true.

The LLM gets it right when the name used is ‘Linda’ because that’s the name that Kahneman and Tversky used in their work to illustrate this fallacy, which means LLMs have seen this problem several times where the name used was Linda. Thus, failure to adapt to new names suggests that LLMs simply memorize the entire sequence instead of fully internalizing the fallacy.

However, larger models like o1-preview and o1-mini are less susceptible to these changes, so Apple levels up the challenge with inconsequential clauses (unnecessary information to see if they fool the LLM).

And boy, they do.

In that test, results collapse, reaching almost 70% performance drop in smaller models. OpenAI’s o1 models fall less but still degrade quite substantially. And that’s the only silver lining we can extract from these tests; LLMs do not reason, but scaling them to larger sizes seems to help mitigate the issues.

But if you think Apple has the harshest take on LLMs, brace yourself. Ironically, these are inside some companies investing more heavily in LLMs, and one of them even proposes a completely reframing of the AI industry.

A complete turnaround.

Yann & Chollet: LLMs Ain’t It.

Yann LeCun is the Chief AI Scientist at Meta, the company behind the Llama LLMs. Despite this, he is a renowned LLM skeptic, claiming they are “dumber than cats.”

He considers them utterly incapable of reasoning, planning, or basically anything remotely related to intelligence, even suggesting an entirely different architecture, JEPAs, that Apple casually adopted for one of its open-source models we covered.

More measured in its judgments, we have François Chollet, a legendary researcher at Google, who invariably agrees with LeCun on most points, making the bold argument that investment in LLMs is actually delaying AGI, and claiming that LLMs lack all the three key points required to build it:

  1. They can’t perform higher-level abstractions (find key regularities in data, as proven by the clear token bias we just saw).

  2. Aren’t capable of acquiring new skills on the fly, meaning that no matter the complexity of the task, even the easy ones, LLMs can’t solve the task if they haven’t seen it before.

  3. They are extremely sample-inefficient. If we assume intelligence as acquiring new skills with little data availability (with just a few examples to learn the pattern), LLMs aren’t more intelligent than crustaceans. Even your dog will learn new skills with millions of fewer data samples than an LLM would require.

But I know what you’re thinking. If LLMs are really dumber than cats, why are they capable of doing awe-inspiring things we see on X or other social networks daily? And the reason for this is that you are framing their intelligence in the wrong way.

The Task Familiarity vs Complexity Conundrum

Using Chollet’s own recommendation, we must evaluate LLMs based on task familiarity rather than task complexity.

Think of a 12-year-old student that knows how to read. You can make her learn any uber-complex mathematical theorem by heart and even teach her to solve PhD-level problems if she tries enough times and the problems stay fairly the same.

At the end of the day, the kid doesn’t really understand the problem but has simply memorized the process. Thus, as long as the conditions don’t vary, the kid can solve problems she doesn’t comprehend, independent of the complexity.

Now, take that kid and, instead of allowing her to memorize, make it reason an undergraduate-level problem. Even though it’s much simpler than the PhD problem she’s learned to solve by heart, she can’t solve it.

The reason for this is that, through memorization, one can simply ‘learn’ to execute any task independent of complexity. Thus, if we send LLMs many examples of very hard problems and evaluate them based on task complexity, then LLMs indeed appear to be ‘PhD-level intelligent.'

However, it takes an elementary problem that is unfamiliar to them to see the cracks in the wall, as the model will most likely fail. Therefore, Chollet suggests we should evaluate LLMs based on task familiarity, the capacity to solve unfamiliar problems, instead of task complexity.

According to his view, task performance does not prove intelligence; it’s the process by which we arrive at an intelligent response that determines whether intelligence is present or not.

Therefore, we are being tricked into believing LLMs are intelligent because most popular benchmarks evaluate ‘intelligence’ based on their outputs and not in the process, making it really hard for us to see whether that smart output was based on rote memorization or true reasoning.

Following this rationale, to fully determine whether LLMs display intelligence, we must evaluate them in situations where core knowledge or past experiences aren’t retrievable; only then, upon succeeding in the task, can we claim the model is processing tasks in an ‘intelligent’ way.

And when researchers do that (as we’ve seen earlier), results are much less encouraging. A very popular benchmark conceived around this idea is one we have shown countless times in this newsletter: the ARC-AGI benchmark, which Chollet authored.

In this benchmark, our best LLM, o1-preview, gets just 21% (crucially, the same score than Claude 3.5 Sonnet, which doesn’t have inference-time search). With brute-force search, the same method used by Fast Downward to humiliate o1 models at planning, it should take you to a 50% score with enough compute.

All things considered, things aren’t looking great for LLMs. Thus, is all hope lost, or do we have ways to cure this illness from LLMs?

Is OpenAI an inflated $157 billion bubble?

And the answer to this trillion-dollars-at-stake question might lie behind the work of anonymous researchers and a low-radar Google Deepmind paper.

Subscribe to Full Premium package to read the rest.

Become a paying subscriber of Full Premium package to get access to this post and other subscriber-only content.

Already a paying subscriber? Sign In.

A subscription gets you:

  • • NO ADS
  • • An additional insights email on Tuesdays
  • • Gain access to TheWhiteBox's knowledge base to access four times more content than the free version on markets, cutting-edge research, company deep dives, AI engineering tips, & more