- TheTechOasis
- Posts
- Mixture-of-Depths, Unveiling the Future of Siri, š¦LLaMa 3, & more
Mixture-of-Depths, Unveiling the Future of Siri, š¦LLaMa 3, & more
š TheTechOasis š
part of the:
Breaking down the most advanced AI systems in the world to prepare you for the future.
5-minute weekly reads.
TLDR:
AI News of the Week
AI Research of the Week:
Google presents Mixture-of-Depths, and it looks amazing
Apple ReALM and Ferret-UI, the Future of Siri?
Leaders: Hereās why AI Will be Decentralized
šØ Weekās Update šØ
Welcome back! This week is packed with very important news.
Starting off, Meta has announced that they are soon releasing LLaMa 3, the model family of LLMs.
I suspect they will be multimodal at least for images, and considering how Meta is consistently pushing forward the state-of-the-art for audio and speech, with examples like Seamless M4T, I would not discard the possibility of seeing the first audio/speech native Large Language Model.
ChatGPTās GPT-4V LLM is not audio native, itās just calling APIs to Whisper for speech/audio transcription and a TTS model for the opposite.
Talking about our friends at OpenAI, GPT-4 Turbo Vision is now generally available to developers through the API, so builders can now handle image and text processing from one single API (as itās the same model).
Moving on to hardware, some really spicy news came from Phoenix. Intel presented its new accelerator chip, the Gaudi 3, that beats NVIDIAās H100 chips by a landslide, especially regarding training speed and power efficiency.
This isnāt that surprising as Stability AI already pointed out recently how they were obtaining better results with Intelās previous chips, the Gaudi 2, over the H100, so this margin should increase more with the new version. That being said, NVIDIA has yet to release its new accelerator, Blackwell.
Fun fact, Intel is 13 times less valuable on the stock market compared to NVIDIA. Thus, is Intel massively undervalued, or is NVIDIA outrageously overvalued?
On the inference side, we still need to pay attention to Groq, as their hardware, the Language Processing Units (LPUs) appear to be outright superior to any GPU in that regard. You can try Groqās hardware here for free.
However, some are arguing for some time that the return of analog systems is inevitable. I highly recommend this video if interested.
Continuing on hardware, in a more futuristic fashion, it seems Sam Altman is pairing with Jony Ive, legendary product designer of the iPhone, iPad, and others, to build a personal AI device that āwonāt be similar to an iPhoneā, and are seeking up to $1 billion in funding.
Hard to imagine what they are referring to, but the closest thing we have to an AI personal device is Rabbitās R1, which relies entirely on voice interaction to execute the tasks you require.
Fun fact, Rabbitās AI system is neurosymbolic. Unlike standard LLMs, it combines neural networks with high-level human-coded programming to increase performance and decrease training times and costs.
I want to end this weekās update with an interesting piece by The New York Times on how most AI labs are desperatelyāand illegallyācutting corners to amass enough data for training.
The NYT is in lawsuits with OpenAI precisely for this matter, so assume certain bias.
šŖ¼ Mixture-of-Depths šŖ¼
Few times Iāve been more excited about a research paper. Presented by Google Deepmind, Mixture-of-Depths, or MoD from now on, simply makes sense.
Its principle?
Not all tasks are made equal. In other words, MoD models can dynamically allocate compute to each prediction, just like a human would.
Proven at scale, MoDs would not only drastically reduce requirements to run models, but they present the opportunity to create smarter and more powerful models.
And the best thing? This can be applied to every single LLM in the world.
Not all Thoughts are Created Equal
Humans, when faced with a task, can decide how much thought, or effort, they will dedicate to the problem.
While some issues can be resolved quickly, even unconsciously, other tasks need your complete attention and focus.
In short, humans allocate compute to a task depending on its predicted complexity.
But do AI models do the same? Hard no. Our current models dedicate the exact compute to every single prediction, no matter how hardāor easyāit is.
This begs the question, if models allocate the same compute to the simpler tasks than to the hard tasks, could it be that AI models, specifically the Transformers, consume more compute than they actually need? And can we make them smarter?
This is precisely what MoD tackles.
Letting Models Decide
To understand how MoD models work, we first need to clarify how Transformer LLMs work.
Everyone Gets Attended
Transformer models, with prominent examples like ChatGPT, Gemini, or Claude, are sequence-to-sequence models that receive an input sequence, like text, and output another sequence, usually the continuation of that text.
They are made by a concatenation of āTransformer blocksā, as explained last week. These blocks have two distinct layers:
A multi-head attention layer, to compute the attention mechanism
A Feedforward Layer (FFN), to improve feature extraction
And thatās basically it.
You stack these blocks one after the other, and in the last layer of the model, you classify all the possible words that can be outputted by the model and choose one of the top-k most reasonable continuations to the sequence. Read this full in-depth explanation of attention here.
But why do we need many blocks? The more blocks you have, aka the larger the modelās depth is, the more nuances it can capture on the relationships between words.
Overall, the essence of the Transformer is that over this concatenation of blocks, the different words in the sequence are updated with information provided by other words.
If I ask you what the word ābatā means, you will answer that āit depends on the contextā.
Well, thatās precisely what attention does, it updates the value of all words in a sequence so that they can account for their surrounding context, so that ābatā refers to an animal or a baseball club.
But hereās the thing.
With standard attention, every word in the sequence gets to attend to every single other word, which is very expensive and, maybeā¦ totally unnecessary?
And here is where MoD comes in.
A Routing Problem
In simple terms, what MoD does is, for every word in a sequence and every block in the model, decide if that token gets updated or not.
āUpdatedā means that the word undergoes the attention process we described earlier.
As shown below, before a Transformer block, every tokenāword in the sequenceāis inserted into a ārouterā, which assigns a weight.
This weight determines how important that word is to that block, which is akin to saying āThis word should attend other words or be attended toā.
The routing is not stochastic, aka random. The parameters of the router are thus learned as part of the training, so that the router ālearnsā to make such decisions.
As shown below, what this means is that, depending on the current word prediction, the model gets to decide what previous words in the sequence are relevant to predict the new word.
Source: Google
Letās say we have the sequence āJane is French, born in the capital, and likes ice cream. Thus, Jane was born inā¦ā
For the next-word prediction, the model has to decide to what previous words the last predicted word, āinā, should pay attention.
For instance, should āinā pay attention to āand likes ice creamā? Do you think that part of the sequence is relevant to the modelās capacity to output āParisā as the next word?
In each case, for this precise prediction:
A standard Transformer would attend to all previous words
On the other hand, a smarter MoD Transformer would decide what tokens are irrelevant and, thus, are routed to the residual connections (depicted as empty arrows at the sides in the picture above), skipping all computation. In laymanās terms, the fact that she likes ice cream is irrelevant and thus those words are ignored for that precise prediction.
See the point? MoD models are āawareā of the value of each word in a sequence in order to predict the next one!
An important note, a word not being chosen doesnāt mean it canāt be chosen in the future; it just depends on the current prediction.
An Exciting Discovery
Overall, MoDās results are incredible. For a fixed compute comparison between MoDs and standard Transformers, MoDs not only are much more efficient, but they are also smarter.
With an 87.5% capacity reduction, meaning that almost 9 out of 10 times words are not computed for any given block, the model was still competitive with standard models.
This means that despite saving almost 90% of the compute cost, the model was still on par with the 100% compute allocation model! And not only that, for the same compute, they actually perform better than the baseline.
One last thing to mention is their relationship with Mixture-of-Experts (MoE) models.
MoE is another form of conditional computing that, instead of choosing what tokens participate in the prediction, breaks down the model into āexpertsā so that each part of the model becomes more specialized per topic and more efficient to run. GPT-4, Mixtral, Gemini 1.5, or Claude 3 are all MoEs.
But does that mean we have to choose between one or the other? Luckily, no.
While MoEs work at the width dimension, MoD works at the depth dimension, where a model has a fixed compute budget and has to be clever about what tokens (words in LLMsā case) to engage for each prediction.
In other words, they can be combined. In fact, MoDās researchers tried it, achieving a highly promising result where the model, for a specific compute budget achieved lower loss combining both methods.
Source: Google
What We Think
In my humble opinion, this is one of those research papers that could set a precedent.
From its results, it can be interpreted that Mixture-of-Depths, besides creating more efficient models, makes models become smarter overall by making them more āawareā of allocating compute to what matters, akin to humans dedicating more thought effort to a hard problem to gain better results.
While it has to be proven at scale, we could soon see all models being MoDs.
š„ This week on Leadersā¦ š„
Iāll say it straight to your face, I canāand willāconvince you that Crypto is not only necessary, but essential, to the future of AI.
I wonāt try to sell you any stupid cryptocurrency, but I will prove to you that blockchain technology is a fundamental piece of the puzzle.
If you think Iām full of crap, I dare you to click the following button. Otherwise, you will remain completely oblivious to one of the most important opportunities of the decade.
š® Appleās Huge Hint on Siriās Future š®
Once again, Apple has released two AI research papers with important insights into their strategy regarding Generative AI.
Undoubtedly, among Appleās product more desperately asking for an upgrade we have Siri, itsāterribleāvoice assistant.
Surprisingly, more than a year after ChatGPTās release, Siri is still as dull and limited, feeling almost prehistoric.
But with Appleās ReALM and Ferret-UI models, we now have a much clearer idea of the future of Siri.
A Long-Awaited Issue to be Solved
At this point, many people will be wondering: āWhy hasnāt Apple updated its voice assistant Siri?ā
Naturally, they have plenty of reasons for it, as companies like Google have already replaced their assistants with a Large Language Model (LLM), Gemini.
But unlike Google, Apple is rarely prone to making big releases without being entirely sure that their product is excellent.
And the problem is that, truthfully, most LLMs today, even our best ones, are terrible voice assistants for two reasons: data and size.
The Missing Data
In AI, data >>> architecture.
No matter how good your architecture design is, without the appropriate data for a task, the model will miserably fail. And reference resolution data might be up there among the worst.
But what is reference resolution?
Reference resolution involves the process of determining what specific entities or items ambiguous references in language are pointing to, considering various types of context.
For example, in the image below, the user asks the voice assistant to displace nearby pharmacies.
Once the agent assistant shows the list, the user might refer to elements in the list indirectly (as in the three examples shown below) requiring the agent to be able to identify the reference.
Source: Apple
If you think about it, most of your interactions with Siri or any other voice assistant are precisely built this way, where you expect the assistant to be able to detect these relationships.
But you can make things even harder for Siri, with examples such as āCan you lower the volume?ā which might refer to your iPhoneās sound, or the room speaker which also happens to be connected to the agent.
And Siri also has to deal with another issue, on-screen references, which means that not only it must deal with textual references, but also visual ones.
Overall, even though our current best models should be able to do all this, they lack the data to optimize for this task, performing horribly.
But thereās an added complexity.
Small Hardware Requires Small Models
If you're going to run Siri on an LLM, you need it to be very, very small.
As the weights file of the Transformer (where its knowledge is stored) has to be in RAM memory, even the most premium smartphones canāt run even a medium-sized model as, currently, the biggest RAM a smartphone offers sits at around 24GB.
This means that you are going to be able to run a 10GB LLM at most, and thatās very optimistic.
So what did Apple do?
Purposely created Data and a Specialist Model
Apple trained ReALM by:
Creating diverse datasets that included examples from conversations, synthetic scenarios, and actual screen content.
Adapting a large language model (FLAN-T5) by teaching it to understand these examples. They converted all the different entities and scenarios into a text format the model could learn from.
Innovating a way to describe whatās on a screen with text only, so the model could "visualize" screen layouts and entities as if it were seeing them, helping it decide what references like "the top one" or "the left one" mean.
Funnily though, ReALM's performance was impressively competitive with GPT-4, even though it is a much lighter model with orders of magnitude fewer parameters.
However, just three days ago Apple announced an MLLM focused on screen detection, named Ferret-UI. As shown below, this model can comprehend the iPhoneāand Android tooāscreens to any required granularity.
It completely obliterates any other MLLM in on-screen tasks, including GPT-4V.
Are these releases pure coincidence? I think not.
Surely, Siriās upgrade will be a combination of both. Probably using Ferret-UI as the base, they will fine-tune it with the reference data used in ReALM to create a model that understands vague or subtle references perfectly while being able to interpret every single object and command portrayed on their screen.
That being said, I expect newer versions of this architecture before the upgrade for two reasons:
They used GPT-4 to generate data, meaning they canāt commercially use Ferret-UI
Ferret-UI is still at least 7 billion parameters in size, so unless they are considering storing LLMs in flash memory, they will need it to get smaller.
What We Think
With every new release, Appleās strategy toward GenAI is very clear:
Everything they are building is focused on the iPhoneās next big update, iOS 18.
Unlike other big tech players, which are trying to create the next big AI leap, Apple seems committed to picking the low-hanging fruits left by their competitors and trying to innovate in the opposite direction:
Instead of making AI bigger and better, making it smaller but more efficient, crucial for consumer-end use cases that are the essence of Appleās business model.
š Sponsor of the Week: MIT š
AI: Balancing Risk and Return
Innovative technologies are revolutionizing business as we know it, and theyāre more accessible than ever. But to truly harness the transformative potential of AI, you need to know how and when to use it. And which pitfalls to avoid.
The six-week Artificial Intelligence: Implications for Business Strategy online short course from MIT Sloan School of Management and MIT Computer Science and Artificial Intelligence Laboratory explores AIās business applications and challenges.
Choose this program to:
Optimize your business: Leverage AI, ML, and robotics to drive efficiencies, improve productivity, and support your growth.
Develop a strategic roadmap: Apply your knowledge to effectively integrate AI into your business.
Gain a dual perspective: Benefit from a course designed by two prestigious schools ā the MIT Sloan School of Management and the MIT CSAIL.
Conveniently build career-critical skills: Follow a program that fits your schedule and benefit from 24/7 support and various payment options.
Do you have any feelings, questions, or intuitions you want to share with me? Reach me at [email protected]