• TheTechOasis
  • Posts
  • Winners & Losers in the age of Augmented LLMs

Winners & Losers in the age of Augmented LLMs

🏝 TheTechOasis 🏝

part of the:

In AI, learning is winning.

This post is part of the Premium Subscription, in which we analyze technology trends, companies, products, and markets to answer AI's most pressing questions.

You will also find this deep-dive in TheWhiteBox ‘Key Future Trends in AI’, for future reference if necessary and to ask me any questions you consider.

Based on recent developments in academia, I finally have the conviction to draw a clear picture of how the AI industry will develop, including winners and losers.

You are reading this here before anyone else: We are entering the era of augmented AI.

In fact, based on this insight, I can confidently predict how the war on demand (who will eventually win the customers) will play out in every cohort, from enterprises to everyday users.

What’s more, I am also bringing to you the high-level Generative AI Demand Framework.

With this framework, you will be able to identify what models to use in each case with confidence that you are making the right choice, especially if there’s money on the line, based on what cutting-edge research indicates.

Finally, we will analyze which type of AI model (& company) will surf this wave and, importantly, those most likely to drown.

When Your Problem is a Feature, not a Bug

The biggest problem in Generative AI models is inaccuracies, sometimes referred to, although wrongly, as hallucinations.

In simple terms, the models eloquently and convincingly respond to many questions with inaccurate statements, which can lead to embarrassing moments for companies like Google.

But here’s the thing: it’s not like the model has decided to generate wrong answers or that the data they were trained for was saying those things… it’s just working as intended.

Understanding How GenAI Works

As you probably know, LLMs provide a reasonable next word to a given sequence.

But here’s the thing: normally, they do not provide the most statistically probable next work, but one of the ‘top-k’ most reasonable ones.

And how do they choose which one is outputted? Mostly random choice.

Looking at the image above, all five words are reasonable continuations, but only one of them (playground) is the actual ground truth. Still, by design, the model is trained to sample one of the five at random, which can lead to inaccuracies.

But, again, it is not because it’s trying to deceive you; it’s by design. Or let me put it more clearly: Every LLM response is a hallucination; only sometimes they get it right.

But why do this? This stochasticity (randomness) is added so that the model can generate similar ideas or statements differently, nurturing its linguistic prowess and creativity at the expense of accuracy.

Now, this isn’t a problem when you use ChatGPT as a fun toy or for occasional random conversations. But when money is on the line, these systems, in their current form, are useless, explaining why the Generative AI industry is massively underperforming and in a clear economic bubble.

And there will be no demand if these models get facts wrong, period. Thus, can we eliminate the “problem” when it’s actually on purpose?

Fighting Hallucinations

For quite some time, Retrieval Augmented Generation, or RAG, has been touted as the cure for all diseases, a method that would provide the best of both worlds.

In short, RAG pipelines connect the LLM to a ‘source of truth’ database, which can be vector databases or knowledge graphs, among other options. Then, RAG uses the prompt to extract the semantically related data from this source and sends it to the LLM as relevant context.

Generated by author

The reason this works is that LLMs can use data they haven’t seen beforehand as context to provide a better response, a feature known as in-context learning.

But how does this work?

Mechanistically speaking, LLMs build ‘induction heads’ during training. These are circuits that can look back in a sequence and ‘copy/paste’ patterns. Therefore, the model can use the words provided by the augmented context and, quite literally, copy and paste them.

For instance, in the sentence “Harry Potter was attacked. Thus, Harry…”, the induction head pays attention to the previous instance of ‘Harry’, looks at the next token (Potter), and attends to it, inducing the model to predict “Potter” as the next token to complete the pattern.

Sadly, however, in-context learning seems to be a very poor form of ‘learning’, and RAG pipelines generally underdeliver by a wide margin and often require continuous prompt engineering (tuning the way we communicate with the LLM), making it even harder to justify the implementation costs.

But a new method has been picking up huge traction lately, and it’s all about returning to the roots.

Going Specific

One obvious way to reduce hallucinations is by forcing a model to memorize the data completely so that every response equals the original sequence seen in training.

This, known as overfitting, is despised by most in the AI industry, as it makes the model incapable of going beyond its training data (generalization).

However, a new method, MoME (“mommy”), allows you to train a model to memorize specific facts while retaining the generalization capabilities that make LLMs useful for many tasks.

Similarly to RAG, for a given user request, we use a mechanism to retrieve the experts who have memorized that fact and embed them into the LLM, inducing it to respond with the literal fact instead of ‘winging it’ with the stochastic response it usually gives back.

For a deep-dive into mome, read the previous issue of this newsletter.

source: Lamini

What is the difference with RAG?

While RAG augments the context provided to the LLM in real-time, MoME augments the LLM, modifying the answer distribution of the model, aka actively modifying the LLM.

That said, a while back, contextual.ai proposed RAG 2.0 where the entire pipeline is fine-tuned, too.

According to the authors of MoME, we can almost eliminate hallucinations without impacting the LLM’s true nature, which emerges from any non-memorized fact.

Visually, the loss curve of MoME models is shown in pink below, where loss measures how accurate a model is for every single next-word prediction, dropping to zero when facts emerge.

Compared to a very good model like LLaMa 3, which specializes in minimizing the average loss per prediction, MoME conserves this low average while dropping the loss to zero for specific facts, eliminating hallucinations.

And here’s the key point here: with MoME and task-augmented models (more on that below), we are entering the age of fine-tuned augmentation.

But what do I mean by that?

Augmentation is the Future of AI

LLM fine-tuned augmentation refers to post-training techniques in which a pre-trained LLM is augmented (grown in size) with small components, usually called adapters.

By only training these small adapters during fine-tuning, we aim to improve the model's performance in specific tasks without affecting its prowess in other tasks.

Let me explain.

Fine-tuning comes at a cost, or does it?

LLMs are generative models. Duh, right? Sure, but this statement carries much more weight than meets the eye.

A generative model is trained to find the parameters (weights) that maximize the likelihood of observing the training data through the model’s generations.

In layman’s terms, generative models are designed to generate data similar to their training data.

This means that the distribution of the training data (what information it has, how represented each concept or event is, etc.) plays a crucial role in the model’s outcome; the model becomes the “embodiment” of its training data.

But why am I telling you this? While naive fine-tuning (retraining the entire model) will improve the performance of the task it is being fine-tuned on, this comes at the expense of the model losing performance in any task underrepresented in this fine-tuning dataset.

This is the reason why models like LLaMa aren’t really ‘open-source’, but ‘open-weights’. As Meta doesn’t publish the pre-training dataset, we don’t know what data it knows and how represented each data point is.

Thus, every single LLaMa fine-tune (like the popular Vicuna) is unequivocally ‘better at something at the expense of being worse at something else we don’t know’.

Well, augmentation is the solution to that problem.

And what better proof that augmented AI is the ‘real deal’ than having one of the largest corporations in the world, Apple, embrace it at the core of its AI strategy and brand-new AI platform, Apple Intelligence?

Apple has grown upwards of $400 billion since the announcement of their augmented LLM platform. Augmented AI is already generating billions in shareholder value and most of the world isn’t even aware of it.

The Adapter Madness

Apple’s new AI platform is the definition of augmentation. As shown below, Apple Intelligence will have single LLMs (both on-device and in the cloud) augmented with adapters specialized in specific tasks.

Source: Apple

Basically, LLM augmentation allows Apple to pack 10+ models into the size of one. This works because, with augmentation, only the adapters (which are tiny in comparison to the base model) are trained during each fine-tuned run, leaving the base LLM untouched.

Therefore, whenever a user asks for a particular task, we augment the LLM with the adapter for that task, execute it, and then offload the adapter in expectation of the next task.

Amazingly, Predibase proved that you can even run 100 different adapters for a single LLM and GPU, aka having 100 different LLMs served from one single instance.

Most enterprises aren’t deploying GenAI solutions simply because they aren’t aware of this, period.

At this point, all signs point in the same direction: fine-tuned augmentations are the secret sauce for GenAI adoption across end-consumers and enterprises.

A Key Difference

Apple Intelligence is ‘task augmentation’ while MoME is ‘memory augmentation’. One specializes the LLM to a given task, the other helps a model memorize facts and avoid hallucinations.

Crucially, both are based on LoRA adapters, which I explain in detail in my Apple Intelligence post.

But, importantly, how does this transform how companies and users will embrace AI?

The Demand Framework

The secret to a successful GenAI implementation is knowing what you are facing.

The Four Horses of Implementation

  1. First, we have the information retrievers.

These are models that need to excel at recalling accurate information. They are primed for knowledge-based use cases, where humans need direct and fast access to vast information with little to no margin for error.

Examples are the typical ‘chat with your data’ use cases, where we use LLMs as a direct, conversational interface to our data.

These aren’t exclusive to large corporations with TB of data to access; they are the quintessential use cases of AI today, with consumer-end examples like Microsoft’s Recall, Apple Intelligence’s Semantic Index, or Google’s upcoming take on this feature.

  1. Next, we have the automators.

These LLMs are meant to automate processes, such as handling customer support, with examples like Klarna. Here, the error tolerance is still very low. Thus, we need models to be extremely accurate.

Also, these models must excel at function-calling, which allows them to call third-party services. APIs have a very low error rate but are extremely sensitive to inaccuracies.

Luckily, as Salesforce recently proved with XLAM, function-calling can be extremely depurated as long as you fine-tune the model to the extreme.

  1. Moving on, we have the Copilots.

Here, models must be “great at everything, perfect at nothing.” In this case, generalization is a more valuable asset than accuracy; we need models that can perform ‘just fine’ in hundreds of different tasks. Models here are used as copilots, human enhancers that help their users perform work faster.

Here, the error tolerance is much larger, as AIs are used for iterative creative processes like coding, writing, graphic design…

And, finally, we have the coworkers.

These models act as literal human counterparts, capable of absorbing work from the three previous categories.

Now that we have clear categories:

  • Which models and companies are better positioned at each level?

  • Which are in real danger?

  • And what is the decision-making process you should be following?

Who Will Win, Open-source or Private?

Once you accept the fate that fine-tuned augmentation will determine the sets of winning and losing models and strategies, everything falls into place rapidly.

Simply put, neither you nor your company will use a frontier AI model like ChatGPT to search your company’s database of the history of the Korean War; you will use a Korean War LLM.

But what do I mean by that?

Subscribe to Leaders to read the rest.

Become a paying subscriber of Leaders to get access to this post and other subscriber-only content.

Already a paying subscriber? Sign In

A subscription gets you:
High-signal deep-dives into the most advanced AI in the world in a easy-to-understand language
Additional insights to other cutting-edge research you should be paying attention to
Curiosity-inducing facts and reflections to make you the most interesting person in the room