TheTechOasis
Posts
The Platonic Representation & Apple Intelligence

The Platonic Representation & Apple Intelligence

Ignacio de Gregorio Noblejas
June 13, 2024

Sponsored by

🏝 TheTechOasis 🏝

part of the:

Welcome to the newsletter that keeps you updated on the latest developments at the cutting edge of AI by breaking down the most advanced systems in the world & the hottest news in the industry.

10-minute weekly reads.

🚨 Week’s Update 🚨

Welcome back! This week, we have news from China, Anthropic, Stability AI, Together.ai, OpenAI, and more.

A new video generation model from China is shocking the world. Named Kling, it offers a performance that matches Sora while having double the video length, up to 2 minutes for high-definition videos.

Fueling this frenzy, San Francisco start-up Luma AI has released DreamMachine, another text-to-video that looks very good. Is the industry getting bored of LLMs and ready for an exciting new race in video?

China’s insane progress comes in handy when considering that the US is trying to kill its open-source industry with the SB-1047 bill. If US regulators think a bill in California is going to be followed by China (or Rusia) they are out of their mind.

Also, considering that most AI talent in the world already comes from China, this could be the beginning of the end of the US’ AI supremacy.

Moving on, Anthropic released quite an interesting piece on how they approach the character development of their chatbot Claude.

This is the type of content I would never show to people starting off with AI, because it’s extremely tempting to believe Claude has a character of its own if you read this.

It’s a fascinating read, but please bear in mind we are talking about a model imitating the desired character, not developing it.

As for the open-source community, Stability AI seems desperate to disappear. They released the weights of Stable Diffusion 3 Medium, and the results are, frankly, embarrassing. Simply put, the model cannot draw humans, generating horrifying scenes.

Stability AI is extremely zealous in ensuring their datasets are clean and absent of undesirable content, like porn. The fact that their models miss on so much unethical data causes the model to fall considerably behind private players.

OpenAI announced its mechanistic interpretability efforts on GPT-4, claiming to have found around 16 million features that help us understand how these models see our world.

Interestingly, the method they used is highly similar to that of Anthropic; the use of sparse autoencoders to map neural activations into interpreatable features.

Continuing on OpenAI, Elon Musk has dropped its March 2024 lawsuit against them, but without explanation.

However, Elon Musk did slam Apple’s announcement on their integration with OpenAI, even threatening the ban of all Apple products in his companies.

Finally, as I know many of you read my newsletter because I tend to avoid the unnecessary hype, I want to leave you with an extract from an interview with François Chollet, creator of the highly influential Keras library, on why LLMs will not take us to AGI.

In fact, he argues LLMs are taking us away from that goal as OpenAI has forced the industry to close-source their innovations.

Nonetheless, François is putting his money where his mouth is, and announced a $1 million prize for someone to develop an AI that adapts to novelty and can reason.

🧐 You Should Pay Attention to 🧐

Demystifying Apple Intelligence
Will Eventually All AIs Be the Same?

📲 How Does Apple Intelligence Work? 📲

As you probably know, Apple has finally unveiled its suite of GenAI capabilities for its product line-up, Apple Intelligence.

They come packed with multiple features, such as an Emoji generator, writing and imaging tools, and a much-demanded Siri update, all of which you can read in detail here.

But Apple has also released many details about the inner workings of these features, which we are unpacking today.

A Great Leap for the Edge

Apple is the first big company to position itself as ‘AI at the edge’.

In other words, much of its efforts in developing Apple Intelligence have been focused on doing as much computation inside the device as possible to avoid privacy concerns.

And while that’s commendable, the truth is that it has been only partially achieved.

A Family of On-Device and Server Models

Counterintuitively, Apple’s GenAI capabilities will depend on a wide range of AI models, not just one.

On the “on-device” side, a base model that most probably will be OpenELM, a 3-billion parameter model they announced back in April 2024.
On the server side, a bigger, more capable model, probably Ferret-UI, another model announced back in April. These models will be stored in heavily protected Apple clouds, which they have called Private Cloud Compute. However, if the user request is too complex, the model will offer you the chance to send it to ChatGPT-4o.

For a deep dive on Ferret, I wrote an article on it back in April.

If you wish to understand the details on how they have protected their GenAI cloud, I’m afraid I’m not a cybersecurity expert, but this X thread goes into quite the detail if that’s your thing.

But, hold on, how does the whole process work then?

The Process Wheel

For every request, an algorithm evaluates the most suitable model.

Ideally, if the “on-device” models can handle it, they are the default choice.
If not, the algorithm sees whether the Apple Server model can.
Otherwise, it prompts you to send it to ChatGPT, which you can opt out of.

But we still haven’t answered the question: how does Apple Intelligence work?

Every Trick in the Book

When running AI on the edge, that is, AIs fully loaded into your device and not connected to the cloud, you have to deal with one reality today: settling for less powerful models.

LLMs are Large, It’s in the Name!

In AI, the game's name is size, so all frontier models, ChatGPT, Claude, and Gemini, comfortably exceed the 1.5 trillion parameter mark.

For reference, the required RAM to store a model of that size sits around 3 TB (approximately 1 TB for every 500 billion parameters at a bfloat16 (2 bytes) per-parameter precision).

That’s almost 38 state-of-the-art GPUs (NVIDIA H100s).

Consequently, we can almost guarantee the model has, at most, just 3 billion parameters, making it completely unsuitable for complex tasks.

But even then, a 3-billion parameter model still occupies 6GB, leaving just 2GB for the rest of the processes run on your smartphone, and without considering the LLM’s KV Cache.

Therefore, Apple had a couple of optimization tricks to make this work.

Let’s Quantize and Fine-tune Baby

For starters, after the base model was trained, it was quantized. In other words, instead of each parameter occupying 2 bytes in memory, they dropped that value to an average of 3.5 bits.

This brings the RAM requirements from 6GB to around 1.5GB, leaving plenty of space for other programs.

But, again, lunch is never free, and we can’t forget that quantizing the weights degrades the model's performance (they lose precision), which is a problem considering that these models are already quite limited.

Apple posited that, given this, why do we use only one model when we can use many specialized models that offer better performance?

But hold on. We are already struggling with memory for one single model; how on Earth are we thinking about using many?

However, there’s the key: these ‘other models’ are all fine-tuned versions of the same model using LoRA adapters.

Specifically, Apple decided to fine-tune the base model for each task, creating a family of task-specific models.

Considering our memory constraints, they used a clever technique called LoRA to make this suitable. To understand LoRA, you need to understand one thing: LLMs are intrinsically low-rank for a specific downstream task.

But what do I mean by that? Simply put, only a fraction of the model “matters” for any given task.

Therefore, when fine-tuning the base model for summarizing an email, instead of fine-tuning the entire base model, we fine-tune the weights that participate in that task.

This is LoRA in a nutshell.

Technically, we create auxiliary adapters to the base model. To do so, we decompose the base model's different weight matrices into two low-rank matrices and fine-tune those matrices instead of the full set of weights.

For instance, if we had a small network with just 25 parameters, you could break down the weight matrix into two rank 1 matrices (one row or one column):

But why does this work?

The rank of a matrix defines the number of linearly independent rows and columns. In othe words, if the rank is 2, we can use just rows in the matrix and combine them to generate the rest.

In layman’s terms, this means that, despite the size of the matrix being larger, only 2 rows really provide unique information for any given task, which is equal to saying that the rest of rows are redundant.

Thus, LoRA works by just optimizing the parameters do that truly matter and leave the rest unchanged.

In the example above, this effectively reduces the number of parameters to fine-tune from 25 to 10. Then, once this adapter is trained, we can add it to the untouched base model, converting it into a fine-tuned model on demand.

A 15-parameter save won’t seem like a lot, but a the billion parameter size these models play in, LoRA adapters can reduce the number of actively trained parameters by 98% or more, in the hundreds of millions mark.

But if we are training ‘separate matrices’, how do we actually alter the model’s behavior?

To do so, we recompose the decomposed matrices by multiplying them. This outputs a matrix of the same size as the original base model matrix (the adapter), which can be added to the base model to create the fine-tuned model.

In the example we saw earlier, if you multiply a 1-by-5 vector with a 5-by-1 vector you obtain a 5-by-5 matrix of same size as the original.

So, what does this all mean to Apple?

Dynamic Model Execution

Easy: On-demand dynamic model adaptation. For instance, let’s say the user asks for a summary of his/her last 10 emails.

Apple Intelligence’s algorithm acknowledges this is a task an “on-device” can handle, identifies the adapter, and loads it into the GPU and the base model.
The fine-tuned model (base model + adapter) executes the command.
The adapter gets offloaded so that another adapter can be brought in.

Although not confirmed, I assume Apple uses some sort of tiered cache, having ‘recently used’ adapters cached for quick loading in case necessary.

This way, Apple can run a super-efficient family of powerful LLMs in one single device, offloading to servers what it can’t solve, but always guaranteeing adept performance and efficient memory usage.

TheWhiteBox’s take

In summary, Apple Intelligence is a really ingenious way of running multiple “on-device,” cloud-based, and even state-of-the-art (ChatGPT) models in one autonomous and intelligent LLM stack.

Additionally, we have covered how they have done this, combining some of AI’s most resourceful and frugal techniques to get the most value for the buck from their users without absolutely surrendering their privacy.

On a final note, it’s commendable how Apple continues to support open-source. It is probably the most open big tech company for AI, even surpassing Meta, as all Apple’s models have both open weights and datasets.

🔥 Sponsored Content 🔥

These stories are presented thanks to beehiiv, an all-in-one newsletter suite built by the early Morning Brew team.

Fully equipped with built-in growth and monetization tools, no code website and newsletter builder, and best-in-class analytics that actually move the needle.

The top newsletters in the world are built on beehiiv, and yours can be too. It's the most affordable option in the market, and you can try it for free — no credit card required.

Get started →

TheTechOasis has been running on BeeHiiv for more than a year now. It’s really second-to-none when it comes to building newsletters.

🏛️ The Platonic Representation 🏛️

A few days ago, I came across a new theory combining AI and Philosophy: Are AIs all Becoming the Same Thing? In other words, will all future AI models be identical?

In fact, this could already be happening, as all models seem to be converging toward a ‘platonic’ representation, one unique way of understanding the world.

If true, the economic and philosophical repercussions would be enormous, as this could signal that AIs are, thanks to foundation models, starting to uncover, explain, and, crucially, predict reality itself.

Representations

To understand why researchers pose this question, we need to understand how models interpret the world.

And that’s through representations.

For every single concept of the world they learn, AI models build a representation, a compressed way of describing it, capturing the key attributes.

In particular, AI models turn every concept into embeddings, a vector-form representation.

But why vectors? Well, if everything in the world becomes a vector, we get two things:

Concepts in numerical form: Machines can only interpret numbers. Therefore, everything needs to be interpreted in some way in the form of numbers
Similarity: By having everything as a vector, we can measure the distance between them.

Therefore, models build a high-dimensional space of world concepts that describes the model’s understanding of the world, a space ruled by one principle: relatedness.

In this inner world, the more similar the two real-world concepts are, the closer their vectors are in this space.

In this space, ‘dog’ and ‘cat’ are closer in space than ‘dog’ and ‘window,’ but also than ‘dog’ and ‘pigeon,’ because ‘dog’ and ‘cat’ are more semantically related (non-aerial, four-legged, domestic, etc.).

But how do they build this space?

Using the ‘dog’ and ‘cat’ example, by seeing billions of texts, the model realizes that ‘dog’ and ‘cat’ are words that are used in similar instances throughout the human language. Thus, they conclude that they must be similar.

Fascinatingly, latent spaces (this is what these spaces are called) work both ways: not only do they help us teach AI models how world concepts, but they can also help uncover world patterns humans have not yet realized. For example:

The Semantic Space Theory behind latent spaces is helping us discover new mappings of human emotion, as Hume.ai has proven.
We are discovering new smells, research led by Osmo.ai by mapping the world’s smells and interpolating to find new ones.

HumeAI’s mapping of facial expressions

But how is this possible?

As it turns out, AI is better than humans at pattern matching, finding key patterns in data that initially seemed oblivious to us or that we were simply too biased to acknowledge (AI can bridge the gap of Western culture with others; for example, people in non-western countries express certain emotions differently).

Therefore, if AIs are truly capable of having an ‘unbiased’ view of reality; are they capable of observing reality just as it is?

And the answer, researchers theorize, is yes.

The Allegory of the Cave

Can AIs learn the true nature of reality that causes every observation or event?

If that’s the case, could we eventually mature our AI training so much that our foundation models understand reality just as it is and, thus, all evolve into the same model?

To validate this theory, researchers claimed the representations of these models should all converge into a single representation of the world, an objectively true and universal way of representing reality just as it is.

Think of representation building as an intelligence act. The closer my representation of the world is to the reality, the more I am proving to understand about it.

But how can we test if that’s taking place? To test this, researchers compared several popular models' latent spaces.

If we recall, an AI model’s representation of the world is a high-dimensional space (represented in three dimensions in the previous diagrams for simplicity) where similar things are closer together, and dissimilar concepts are pushed apart.

However, not only does the overall distribution of concept representations matter, but also their distances.

In layman’s terms, the representation that one model assigns to the color ‘red’ should be similar not only to other models in the same modality (comparing LLMs) but the way a Language Model and a Vision Model interpret—encode—‘red’ should be similar too.

That way, we are proving that both models are converging into the same concept of ‘red’. This is what the theory of Platonic representation signifies: that reality is not dependent on the eyes of the beholder; it’s universal.

But why call it Platonic representation?

In particular, researchers reference Plato’s Allegory of the Cave, where the current data we are feeding the models are the shadows, a vague representation of reality, and our previous AI systems are the people watching the shadows, meaning they only have a partial view of life.

But with scale and being forced to multitask, foundation models might transcend their data, eventually emerging from the cave to learn the true nature of reality.

One can make the argument that the training data is the ‘shadow’ because we humans are still, to this day, oblivious to the true nature of reality.

This would also serve as proof that our current methods of AI training, imitation learning on human data, can never reach AGI, let alone superintelligence, because humanity’s own limitations chain our AIs to seeing the world in the biased way we do.

In this scenario, we would never build AGI until we develop a method for AIs to observe the world and learn from it, instead of through the proxy of human data.

Case in point, if our AIs can one day observe reality just as it is, independently of the modality (language models, image models, or video models), they should all have an identical definition of reality.

But what do we mean by that?

Visually, looking at the image below, f_image and f_text observe the same world concept, so their representations—their vectors—should be identical.

This is all great, but do we have any indications that this is taking place?

Convergence

Fascinatingly, foundation models seem to be converging.

When comparing a set of LLMs with another set of vision models, their alignment (how similar the inner representations are) has an almost 1:1 correlation between LLM performance and alignment to the vision model.

In layman’s terms, the better our LLM performs, the more similar its representations become to those of the vision models, which is another way of saying:

The better LLMs become, the more their world understanding converges into vision models’, signaling that as models get bigger, independently of their modality and data used, they all seem to be converging into the same representation, the same understanding of reality, and, thus, become smarter.

Comparing frontier models to DINO vision model

The reason this may be happening is made obvious by the image below: as the model has to find a common way to solve more than one problem, the space of possible solutions to both tasks becomes smaller:

Consequently, the larger the model and the broader the set of skills it is trained for, the more these models tend to converge into each other, independently of the modality and datasets used, painting the possibility that, one day, all our frontier labs will converge into creating the same model.

TheWhiteBox’s take:

When AI meets philosophy.

To me, the most critical point of interest, besides the fact that this could have tremendous commoditizing effects on the market, is how this could be the first sign that we humans are approaching the creation of world models, AI models that can observe and predict the world.

World models are essential to almost any AI field. They are thought to be, just like humans (we also build world models in our brain), the key to teaching machines common sense, the capability that would allow them to survive and, eventually, thrive in our world through embodied AI systems.

🧐 Closing Thoughts 🧐

This week, we’ve had two sides of the coin. On the one hand, Apple has explained to the world that Generative AI is ripe to become a tool for everyday use.

GenAI needs to become truly useful soon, and Apple may again be the company that made this happen.

On the other hand, with the next frontier of AI models training as we speak, society needs to start asking itself: what are we really building?

Bear in mind that before the end of the year, you will witness the arrival of a new set of ultra-powerful models that will embarrass our current state of the art.

Yet, we humans still can’t explain how these soon-to-be-obsolete models work, let alone the ones just around the corner.

And that’s a problem.

Give a Rating to Today's Newsletter

Do you have any feelings, questions, or intuitions you want to share with me? Reach me at [email protected]