TheTechOasis
Posts
Mixture-of-Depths, Unveiling the Future of Siri, 🦙LLaMa 3, & more

Mixture-of-Depths, Unveiling the Future of Siri, 🦙LLaMa 3, & more

Ignacio de Gregorio Noblejas
April 11, 2024

Sponsored by

🏝 TheTechOasis 🏝

part of the:

Breaking down the most advanced AI systems in the world to prepare you for the future.

5-minute weekly reads.

TLDR:

AI News of the Week
AI Research of the Week:
- Google presents Mixture-of-Depths, and it looks amazing
- Apple ReALM and Ferret-UI, the Future of Siri?
Leaders: Here’s why AI Will be Decentralized

🚨 Week’s Update 🚨

Welcome back! This week is packed with very important news.

Starting off, Meta has announced that they are soon releasing LLaMa 3, the model family of LLMs.

I suspect they will be multimodal at least for images, and considering how Meta is consistently pushing forward the state-of-the-art for audio and speech, with examples like Seamless M4T, I would not discard the possibility of seeing the first audio/speech native Large Language Model.

ChatGPT’s GPT-4V LLM is not audio native, it’s just calling APIs to Whisper for speech/audio transcription and a TTS model for the opposite.

Talking about our friends at OpenAI, GPT-4 Turbo Vision is now generally available to developers through the API, so builders can now handle image and text processing from one single API (as it’s the same model).

Moving on to hardware, some really spicy news came from Phoenix. Intel presented its new accelerator chip, the Gaudi 3, that beats NVIDIA’s H100 chips by a landslide, especially regarding training speed and power efficiency.

This isn’t that surprising as Stability AI already pointed out recently how they were obtaining better results with Intel’s previous chips, the Gaudi 2, over the H100, so this margin should increase more with the new version. That being said, NVIDIA has yet to release its new accelerator, Blackwell.

Fun fact, Intel is 13 times less valuable on the stock market compared to NVIDIA. Thus, is Intel massively undervalued, or is NVIDIA outrageously overvalued?

On the inference side, we still need to pay attention to Groq, as their hardware, the Language Processing Units (LPUs) appear to be outright superior to any GPU in that regard. You can try Groq’s hardware here for free.

However, some are arguing for some time that the return of analog systems is inevitable. I highly recommend this video if interested.

Continuing on hardware, in a more futuristic fashion, it seems Sam Altman is pairing with Jony Ive, legendary product designer of the iPhone, iPad, and others, to build a personal AI device that ‘won’t be similar to an iPhone’, and are seeking up to $1 billion in funding.

Hard to imagine what they are referring to, but the closest thing we have to an AI personal device is Rabbit’s R1, which relies entirely on voice interaction to execute the tasks you require.

Fun fact, Rabbit’s AI system is neurosymbolic. Unlike standard LLMs, it combines neural networks with high-level human-coded programming to increase performance and decrease training times and costs.

I want to end this week’s update with an interesting piece by The New York Times on how most AI labs are desperately—and illegally—cutting corners to amass enough data for training.

The NYT is in lawsuits with OpenAI precisely for this matter, so assume certain bias.

🪼 Mixture-of-Depths 🪼

Few times I’ve been more excited about a research paper. Presented by Google Deepmind, Mixture-of-Depths, or MoD from now on, simply makes sense.

Its principle?

Not all tasks are made equal. In other words, MoD models can dynamically allocate compute to each prediction, just like a human would.

Proven at scale, MoDs would not only drastically reduce requirements to run models, but they present the opportunity to create smarter and more powerful models.

And the best thing? This can be applied to every single LLM in the world.

Not all Thoughts are Created Equal

Humans, when faced with a task, can decide how much thought, or effort, they will dedicate to the problem.

While some issues can be resolved quickly, even unconsciously, other tasks need your complete attention and focus.

In short, humans allocate compute to a task depending on its predicted complexity.

But do AI models do the same? Hard no. Our current models dedicate the exact compute to every single prediction, no matter how hard—or easy—it is.

This begs the question, if models allocate the same compute to the simpler tasks than to the hard tasks, could it be that AI models, specifically the Transformers, consume more compute than they actually need? And can we make them smarter?

This is precisely what MoD tackles.

Letting Models Decide

To understand how MoD models work, we first need to clarify how Transformer LLMs work.

We are focusing on Transformer-only LLMs, as the paper only focuses on those. But you could potentially apply MoDs to other architectures like Mamba, or Hyena.

Everyone Gets Attended

Transformer models, with prominent examples like ChatGPT, Gemini, or Claude, are sequence-to-sequence models that receive an input sequence, like text, and output another sequence, usually the continuation of that text.

They are made by a concatenation of ‘Transformer blocks’, as explained last week. These blocks have two distinct layers:

A multi-head attention layer, to compute the attention mechanism
A Feedforward Layer (FFN), to improve feature extraction

And that’s basically it.

You stack these blocks one after the other, and in the last layer of the model, you classify all the possible words that can be outputted by the model and choose one of the top-k most reasonable continuations to the sequence. Read this full in-depth explanation of attention here.

But why do we need many blocks? The more blocks you have, aka the larger the model’s depth is, the more nuances it can capture on the relationships between words.

Overall, the essence of the Transformer is that over this concatenation of blocks, the different words in the sequence are updated with information provided by other words.

If I ask you what the word ‘bat’ means, you will answer that ‘it depends on the context’.

Well, that’s precisely what attention does, it updates the value of all words in a sequence so that they can account for their surrounding context, so that ‘bat’ refers to an animal or a baseball club.

But here’s the thing.

With standard attention, every word in the sequence gets to attend to every single other word, which is very expensive and, maybe… totally unnecessary?

And here is where MoD comes in.

A Routing Problem

In simple terms, what MoD does is, for every word in a sequence and every block in the model, decide if that token gets updated or not.

‘Updated’ means that the word undergoes the attention process we described earlier.

As shown below, before a Transformer block, every token—word in the sequence—is inserted into a ‘router’, which assigns a weight.

This weight determines how important that word is to that block, which is akin to saying “This word should attend other words or be attended to”.

The routing is not stochastic, aka random. The parameters of the router are thus learned as part of the training, so that the router ‘learns’ to make such decisions.

As shown below, what this means is that, depending on the current word prediction, the model gets to decide what previous words in the sequence are relevant to predict the new word.

Source: Google

Let’s say we have the sequence “Jane is French, born in the capital, and likes ice cream. Thus, Jane was born in…“

For the next-word prediction, the model has to decide to what previous words the last predicted word, ‘in’, should pay attention.

For instance, should ‘in’ pay attention to “and likes ice cream”? Do you think that part of the sequence is relevant to the model’s capacity to output ‘Paris’ as the next word?

In each case, for this precise prediction:

A standard Transformer would attend to all previous words
On the other hand, a smarter MoD Transformer would decide what tokens are irrelevant and, thus, are routed to the residual connections (depicted as empty arrows at the sides in the picture above), skipping all computation. In layman’s terms, the fact that she likes ice cream is irrelevant and thus those words are ignored for that precise prediction.

See the point? MoD models are ‘aware’ of the value of each word in a sequence in order to predict the next one!

An important note, a word not being chosen doesn’t mean it can’t be chosen in the future; it just depends on the current prediction.

An Exciting Discovery

Overall, MoD’s results are incredible. For a fixed compute comparison between MoDs and standard Transformers, MoDs not only are much more efficient, but they are also smarter.

With an 87.5% capacity reduction, meaning that almost 9 out of 10 times words are not computed for any given block, the model was still competitive with standard models.

This means that despite saving almost 90% of the compute cost, the model was still on par with the 100% compute allocation model! And not only that, for the same compute, they actually perform better than the baseline.

One last thing to mention is their relationship with Mixture-of-Experts (MoE) models.

MoE is another form of conditional computing that, instead of choosing what tokens participate in the prediction, breaks down the model into ‘experts’ so that each part of the model becomes more specialized per topic and more efficient to run. GPT-4, Mixtral, Gemini 1.5, or Claude 3 are all MoEs.

But does that mean we have to choose between one or the other? Luckily, no.

While MoEs work at the width dimension, MoD works at the depth dimension, where a model has a fixed compute budget and has to be clever about what tokens (words in LLMs’ case) to engage for each prediction.

In other words, they can be combined. In fact, MoD’s researchers tried it, achieving a highly promising result where the model, for a specific compute budget achieved lower loss combining both methods.

Source: Google

What We Think

In my humble opinion, this is one of those research papers that could set a precedent.

From its results, it can be interpreted that Mixture-of-Depths, besides creating more efficient models, makes models become smarter overall by making them more ‘aware’ of allocating compute to what matters, akin to humans dedicating more thought effort to a hard problem to gain better results.

While it has to be proven at scale, we could soon see all models being MoDs.

🥇 This week on Leaders… 🥇

I’ll say it straight to your face, I can—and will—convince you that Crypto is not only necessary, but essential, to the future of AI.

I won’t try to sell you any stupid cryptocurrency, but I will prove to you that blockchain technology is a fundamental piece of the puzzle.

If you think I’m full of crap, I dare you to click the following button. Otherwise, you will remain completely oblivious to one of the most important opportunities of the decade.

🔮 Apple’s Huge Hint on Siri’s Future 🔮

Once again, Apple has released two AI research papers with important insights into their strategy regarding Generative AI.

Undoubtedly, among Apple’s product more desperately asking for an upgrade we have Siri, its—terrible—voice assistant.

Surprisingly, more than a year after ChatGPT’s release, Siri is still as dull and limited, feeling almost prehistoric.

But with Apple’s ReALM and Ferret-UI models, we now have a much clearer idea of the future of Siri.

A Long-Awaited Issue to be Solved

At this point, many people will be wondering: “Why hasn’t Apple updated its voice assistant Siri?”

Naturally, they have plenty of reasons for it, as companies like Google have already replaced their assistants with a Large Language Model (LLM), Gemini.

But unlike Google, Apple is rarely prone to making big releases without being entirely sure that their product is excellent.

And the problem is that, truthfully, most LLMs today, even our best ones, are terrible voice assistants for two reasons: data and size.

The Missing Data

In AI, data >>> architecture.

No matter how good your architecture design is, without the appropriate data for a task, the model will miserably fail. And reference resolution data might be up there among the worst.

But what is reference resolution?

Reference resolution involves the process of determining what specific entities or items ambiguous references in language are pointing to, considering various types of context.

For example, in the image below, the user asks the voice assistant to displace nearby pharmacies.

Once the agent assistant shows the list, the user might refer to elements in the list indirectly (as in the three examples shown below) requiring the agent to be able to identify the reference.

Source: Apple

If you think about it, most of your interactions with Siri or any other voice assistant are precisely built this way, where you expect the assistant to be able to detect these relationships.

But you can make things even harder for Siri, with examples such as “Can you lower the volume?” which might refer to your iPhone’s sound, or the room speaker which also happens to be connected to the agent.

And Siri also has to deal with another issue, on-screen references, which means that not only it must deal with textual references, but also visual ones.

Overall, even though our current best models should be able to do all this, they lack the data to optimize for this task, performing horribly.

But there’s an added complexity.

Small Hardware Requires Small Models

If you're going to run Siri on an LLM, you need it to be very, very small.

As the weights file of the Transformer (where its knowledge is stored) has to be in RAM memory, even the most premium smartphones can’t run even a medium-sized model as, currently, the biggest RAM a smartphone offers sits at around 24GB.

This means that you are going to be able to run a 10GB LLM at most, and that’s very optimistic.

So what did Apple do?

Purposely created Data and a Specialist Model

Apple trained ReALM by:

Creating diverse datasets that included examples from conversations, synthetic scenarios, and actual screen content.
Adapting a large language model (FLAN-T5) by teaching it to understand these examples. They converted all the different entities and scenarios into a text format the model could learn from.
Innovating a way to describe what’s on a screen with text only, so the model could "visualize" screen layouts and entities as if it were seeing them, helping it decide what references like "the top one" or "the left one" mean.

Funnily though, ReALM's performance was impressively competitive with GPT-4, even though it is a much lighter model with orders of magnitude fewer parameters.

However, just three days ago Apple announced an MLLM focused on screen detection, named Ferret-UI. As shown below, this model can comprehend the iPhone—and Android too—screens to any required granularity.

It completely obliterates any other MLLM in on-screen tasks, including GPT-4V.

Are these releases pure coincidence? I think not.

Surely, Siri’s upgrade will be a combination of both. Probably using Ferret-UI as the base, they will fine-tune it with the reference data used in ReALM to create a model that understands vague or subtle references perfectly while being able to interpret every single object and command portrayed on their screen.

That being said, I expect newer versions of this architecture before the upgrade for two reasons:

They used GPT-4 to generate data, meaning they can’t commercially use Ferret-UI
Ferret-UI is still at least 7 billion parameters in size, so unless they are considering storing LLMs in flash memory, they will need it to get smaller.

What We Think

With every new release, Apple’s strategy toward GenAI is very clear:

Everything they are building is focused on the iPhone’s next big update, iOS 18.

Unlike other big tech players, which are trying to create the next big AI leap, Apple seems committed to picking the low-hanging fruits left by their competitors and trying to innovate in the opposite direction:

Instead of making AI bigger and better, making it smaller but more efficient, crucial for consumer-end use cases that are the essence of Apple’s business model.

📚 Sponsor of the Week: MIT 📚

AI: Balancing Risk and Return

Innovative technologies are revolutionizing business as we know it, and they’re more accessible than ever. But to truly harness the transformative potential of AI, you need to know how and when to use it. And which pitfalls to avoid.

The six-week Artificial Intelligence: Implications for Business Strategy online short course from MIT Sloan School of Management and MIT Computer Science and Artificial Intelligence Laboratory explores AI’s business applications and challenges.

Choose this program to:

Optimize your business: Leverage AI, ML, and robotics to drive efficiencies, improve productivity, and support your growth.
Develop a strategic roadmap: Apply your knowledge to effectively integrate AI into your business.
Gain a dual perspective: Benefit from a course designed by two prestigious schools — the MIT Sloan School of Management and the MIT CSAIL.
Conveniently build career-critical skills: Follow a program that fits your schedule and benefit from 24/7 support and various payment options.

Do you have any feelings, questions, or intuitions you want to share with me? Reach me at [email protected]