TheTechOasis
Posts
Google's Infinite Attention, What Grok Tells about Tesla , & More

Google's Infinite Attention, What Grok Tells about Tesla , & More

Ignacio de Gregorio Noblejas
April 18, 2024

In partnership with

🏝 TheTechOasis 🏝

part of the:

Breaking down the most advanced AI systems in the world to prepare you for the future.

10-minute weekly reads.

TLDR:

AI News of the Week: Apple, Stanford, OpenAI…
AI Research of the Week:
- Toward Infinite Attention, the Breakthrough Google was Waiting for?
- Why Grok 1.5 Vision Might Safe Tesla

🚨 Week’s Update 🚨

Welcome back!

As usual in AI, the week has plenty of new announcements. For starters, we got ourselves the first important firing of researchers for leaking information. And the executor was none other than OpenAI.

Apple announced that all new AI enhancements coming to the iPhone with iOS 18 will be privacy-preserving by being “on-device”.

In layman’s terms, the iPhone won’t require an Internet connection to run your AI systems. This is important because, as we saw last week, Apple will introduce Multimodal Large Language Models into their stack to revamp Siri, among other products.

Contrary to what many believe, MLLMs do not require Internet connection. As long as you can store the weights file and the executable on your device, you are good to go.

Apple’s research on Flash LLMs starts to make much more sense now…

For the seventh year straight, Stanford has released its AI Index Report, with notable insights on how AI is accelerating scientific progress (more on that in Sunday’s Leaders newsletter) and how regulators worldwide are increasing their pace.

Moving on, Google Deepmind has unveiled the ALOHA project to build low-cost highly-dexterous robots. The videos uploaded by researchers show robots tying shoelaces, or hanging a shirt.

For the next news, the video speaks for itself. This robot coming from your darkest nightmares is Boston Dynamics’ latest humanoid.

Fun fact, Boston Dynamics has historically avoided using AI. I wonder if that continues to be the case as Tesla or Figure.ai rely entirely on LLMs to work.

On the hardware side of things, Logitech has unveiled an ‘AI mouse’ that in reality, it’s simply your standard mouse seamlessly integrated with ChatGPT to access the LLM much faster and efficiently.

It’s nice to see AI being embedded into products. Can’t wait to see Logitech brand itself as “an AI company” though, get ready for that nonsense too!

On a final note, let’s get our tin hats on, as reports suggest that OpenAI and Meta’s new models (potentially GPT-5 and LLaMa 3) will show a stark improvement in reasoning and planning.

Surely not AGI, but something better than what we have seen until now.

📈 Toward Infinitely-Long Sequences 📈

Long sequences are an absolute unlocker for multiple use cases like DNA processing, long summarization, or video processing, which require orders of magnitude more data than text.

This week, Google presented Infini-attention, a new attention mechanism variant that promises the capacity to scale input sequences to infinite lengths.

What’s more, as we’ll see in a minute, it’s highly likely this is the breakthrough that allowed Google to release Gemini 1.5 weeks after announcing Gemini 1.0 but with an increased 1-million-token context window, a ten-fold increase.

But to grasp why this is the case, we first must understand what were the current issues preventing models from scaling to long sequences.

The Gift that Keeps on Giving

When it comes to the world of Large Language Models (LLMs), be that ChatGPT or Gemini, everything today is a Transformer, a specific type of architecture that enforces the attention mechanism.

For an in-depth review on Transformers, read my blog here.

The core principle is, that to comprehend what is being sent, an LLM uses attention to update the meaning of words concerning other words in the sequence.

Just like in the example ‘I love my baseball bat’ humans realize ‘bat’ refers to a baseball club and not an animal by looking at the other words in the sequence, LLMs do essentially the same.

In practice, paying attention means computing the similarity between words. If the attention score between two words is high (for instance a noun and the adjective referring to it) then these words’ ‘meaning’ gets updated with the information provided by the other word.

Before attention, ‘bat’ had multiple possible meanings. After attention, it’s undeniably a baseball bat. It’s that simple.

This concept yields amazing results, but it comes at an expensive cost.

The Great Burden

As every word has to pay attention to every single other word in the sequence to compute its attention score, we have N² computations for a sequence of length ‘N’.

In technical terms, this means that the attention mechanism implies a quadratic complexity with respect to the input sequence’s length. If we double the sequence length, the computation cost quadruples. If we triple, the cost increases nine-fold.

Memory costs also have a quadratic complexity to the input sequence because of the KV Cache, which is actually a bigger problem for long sequences.

This forces AI engineers to abruptly limit the maximum length a Transformer can process, a term we describe as a context window. Besides being a nuisance when wanting to process large sequences, this has another immediate impact: the model is forced to forget what it saw earlier.

For instance, if we have a 100k context length (75k words) and we arrive at the limit, for future predictions the model stops paying attention to the first tokens in the sequence, which means it forgets those parts.

But with Infini-attention, we might be on the verge of a new trend: infinite sequences despite bounded compute and memory.

Stop Forgetting

In standard attention, LLMs can only attend to the data in its current context window, aka ‘it can only attend to the last x words in the sequence’.

The rest are simply forgotten.

To mitigate this, the Google researchers introduce the concept of compressive memory. In layman’s terms, the model not only has access to the most recent text, but also to a compressed summary of the past.

Source: Google

If we recall the principles of attention, this idea of ‘computing similarity’ between words is done through a three-vector projection. In simple terms, each word generates three vectors:

A query, which states what the word is looking for
A key, which states what information the word provides
A value, which states what information the word will provide to tokens that decide to attend to it

This entire process of the attention mechanism is thoroughly explained in this blog post

So what do LLMs do to access their past memory? They use the query vectors to search and retrieve the most valuable data from the past.

But at this point you may be wondering, but how does this solve anything? Here’s the crucial point: Infini-attention incorporates a different attention mechanism, linear attention, which is much cheaper.

Toward subquadratic retrieval

Without this last point, you would be dead on, we are essentially doing the same thing. However, as mentioned, the way the model accesses data from previous segments is different.

In standard attention, as sequences grow, everything grows too. But if you look back, you’ll notice I mentioned the word ‘compressed’.

In layman’s terms, the memory size over segments in the past is compressed and fixed (depicted in green in the previous image).

Consequently, although the model has fully-fledged access to all the data on the most recent segment (meaning that the attention mechanism in this segment is the same as always) the data from past segments is aggregated.

For full understanding, let’s see a comparison:

Let’s say we have a book with 12 chapters, each with 1,000 words.

- We want our standard-attention LLM, which has a context window of 1,000 words, to read it. If the model has just read Chatper 12, it has access to absolutely every single detail in this chapter. However, it has completely forgotten everything that took place in the previous 11.

- On the other hand, an Infini-attention LLM not only has full access to all info in Chapter 12, but also a summary of the past 11 chapters.

Of course, this means we are incurring information loss from the past, but that’s a game-changing improvement from standard LLMs that do not have any memory at all!

But how is this technically implemented?

The Return of Linear Attention

In simple terms, the model carries a state, a summary of the past, in a similar fashion to LSTMs and RNNs. This memory can be retrieved, and can also be updated.

Consequently, every time a new segment is read, the memory gets updated by aggregating the knowledge from the past with the new knowledge, in a recurrent fashion.

Importantly, as this ‘memory’ is compressed accessing it is much cheaper.

This is beautifully explained by the original proponents of linear attention, but the key intuition is that, instead of computing N²attention scores between all words, they aggregate the key and value vectors into a compressed global context vector.

Using the book example, this is analogous to thinking that, instead of having to review the entire first eleven Chapters word by word, we can recall the summary, which is far cheaper and faster.

But I don’t want to bore you with unnecessary detail. The way Infini-attention works boils down to two components:

For words in the last segment (akin to the context window) the model performs standard dot-product attention, with all the advantages and quadratic disadvantages it entails.
But to access its past memory, the model uses linear attention, where instead of performing N² computations, the model only has access to a heavily compressed summary of the past.

One last point worth mentioning. Infini-attention, just like its traditional counterpart, has recency bias as its main inductive bias.

In other words, as the model has full access to recent words but a summary of the past, this architecture is assuming that recent context is more relevant than past context in order to model language.

Using the book analogy, to predict what will happen in Chatper 13, the model is assuming that Chapter 12 is more relevant than Chapters 1 to 11.

Results-wise, Infini-attention looks highly promising, yielding almost 100% retrieval in the passkey problem (finding a random, very specific number combination in a huge unrelated text), which means that the results they offer are a strong indication that this method is already used inside Google or even available to the public with Gemini 1.5.

What We Think

The best thing about Infini-attention is the fact that it doesn’t require retraining the entire model, as it introduces minimal changes to standard attention (it just adds the compressed memory).

This explains how fast Google enhanced Gemini 1.0 into a 10-fold context-window increase of Gemini 1.5 weeks later, while these models can take months to train.

The natural next step is to analyze whether this compressed memory is enough for models to scale in length indefinitely, as the fixed-sized memory will probably start forgetting important stuff due to its limited size.

That being said, it’s really fascinating to see the convergence of Transformers with recurrent models and what seems to be the natural evolution of state-of-the-art AI to scale to billions of tokens in memory.

📚 Sponsor of the Week 📚

Learn AI in 5 Minutes a Day

AI Tool Report is one of the fastest-growing and most respected newsletters in the world, with over 550,000 readers from companies like OpenAI, Nvidia, Meta, Microsoft, and more.

Our research team spends hundreds of hours a week summarizing the latest news, and finding you the best opportunities to save time and earn more using AI.

🏎️ Is Grok 1.5Vision the Solution to Tesla’s FSD? 🏎️

Grok 1.5 Vision, the new super Multimodal Large Language Model, could have an impact well beyond expected.

In fact, it could be the sign Tesla has been waiting for to take its Full Self-Driving (FSD) mode to the next level.

But how?

Overhyped but Underdelivered

Look, I own a Tesla Model 3, so I am as biased as they get.

And still, to this day, haven’t touched the FSD mode, because it’s equally expensive as it is underwhelming.

Nevertheless, while Tesla’s FSD mode is still an SAE Level 2, other car manufacturers have released Level 3 vehicles already, like Mercedes’ EQS sedan and S-classes, for instance.

SAE levels are the 5 levels of autonomous driving according to the Society of Automotive Engineers, from no automation (Level 0) to full automation (Level 5).

Level 2 essentially means that the car is not trusted to monitor the road, period.

Unsurprisingly, the FSD, despite the very recent cost reduction, is as unpopular as ever, on par with overall sentiment regarding this industry, where only 9% of Americans actually trust these systems according to AAA.

Adding insult to injury, Tesla’s bet on camera-based methods instead of LiDAR is quite controversial, as many consider the latter superior for complex situations.

Tesla’s FSD is also object-based (detects objects, predicts its trajectories, and takes decision based on those predictions).

Other labs are proposing occupancy-based systems, where instead of measuing object trajectories, the car’s chooses new trajectory based on the probability of a position being occupied by other cars or not by predicting collision probability. There’s no consensus on which option is best.

Overall, Tesla’s FSD seems to be in the perfect storm: overhyped, overall low trust, and considered completely unsuited for edge cases or highly complex situations.

So, how could Grok 1.5Vision change that?

MLLMs as high-level planners and reasoners for Cars

In case you aren’t aware, Elon Musk not only owns Tesla, but an AI company known as xAI.

A few days ago, they announced Grok 1.5 Vision, a MLLM that matches the performance of the “great three”: GPT-4 (OpenAI), Claude 3 (Anthropic), and Gemini (Google), in most popular benchmarks:

Word of caution: Grok 1.5V’s results have yet to be proven “in the wild”, so take this with a pinch of salt.

Still, the MLLM looks pretty good if you look at the examples in the link at the beginning of this article.

But what is an MLLM? In simple terms, MLLMs are Large Language Models paired with the capacity to also process other data modalities, like images, audio, or videos.

But what does this have to do with Tesla’s FSD? Well, everything.

Out of all the main constraints that prevent FSD systems from becoming more autonomous is their capacity to handle ‘edge cases’, unexpected complex situations that the model has rarely seen in the training data (or ever), which are the perfect situation for the car to fail.

In practice, camera-based FSD software like Tesla consistently scrutinizes the data coming from the car’s cameras and sensors to predict the current state of the driving environment, with the car’s policy deciding which next action to take.

For example, if the car approaches a STOP sign, the model’s action should be to bring itself to a halt. But the key differentiator between standard driving and edge cases is the thinking mode they require.

Picture yourself driving. In most cases, you aren’t really thinking about the actions you take, like switching gears, changing lanes, or pushing the brake pedal.

These are instinctive actions based on the thousands of hours you’ve already driven. This is “System 1 thinking” based on the late Daniel Kahneman’s theory of thinking modes.

But in very weird situations you’ve never seen before, or in situations where you are lost, you might need to plan your actions.

It still has to be fast, but you are anyway thinking about your actions instead of instinctively reacting. This is what Daniel described as “System 2”.

Therefore, although our current FSD systems seem to work fine in events requiring System 1 decisions, they suffer in System 2 ones. But what if we could use MLLMs to provide this reasoning logic to cars and enhance their decisions?

In fact, we already have a precedent.

The Lingo-1 Model

The UK-based company Wayve is already exploring this idea, with examples as the ones below, where the self-driving car explains the decisions it makes:

Source: Wayve

In fact, Lingo’s newest version was just announced yesterday.

Could this synergy pave the way for autonomous cars to perform well even in edge cases that require planning and complex decision-making?

What’s more, as Elon Musk himself admitted, there’s a tremendous data upside for Tesla, which already holds an unfathomable library of real-life video of real-life car events.

This could propel Tesla’s FSD to a completely new level while also catapulting xAI’s effort with Grok to an unprecedented scale.

Let’s not forget that Elon’s goal with xAI was to create AGI, and real-world video seems to be gaining presence as the most critical data source for that event, as acknowledged by OpenAI too.

What We Think

Talking cars feel too cyber-punky, but they seem feasible by today’s standards. However, several questions remain before powerful MLLM-enhanced FSD software becomes a thing:

Let’s not overhype, MLLMs aren’t great reasoners today. In fact, “System 2” reasoning and planning have been largely unachieved and are being obsessively studied by all main AI research labs from OpenAI to Google Deepmind.
Camera & object-based FSD software isn’t universally considered the best option. As we saw earlier, many others in the industry are betting strongly on LiDAR.
Latency. In FSD systems, latency is everything, as cars need to make millisecond decisions based on huge quantities of data. And while LPUs like that of Groq show great promise, latency isn’t particularly Large Language Models’ strongest suit today.

That being said, cars that combine their FSD software with MLLMs could see improved reasoning in complex cases, while also adding an extra layer of explainability, as the car itself can “reason” its actions, which could be a fundamental lever in improving overall trust in the systems while also helping engineers debug them.

Give a Rating to Today's Newsletter

Do you have any feelings, questions, or intuitions you want to share with me? Reach me at [email protected]