- TheTechOasis
- Posts
- Google's Infinite Attention, What Grok Tells about Tesla , & More
Google's Infinite Attention, What Grok Tells about Tesla , & More
đ TheTechOasis đ
part of the:
Breaking down the most advanced AI systems in the world to prepare you for the future.
10-minute weekly reads.
TLDR:
AI News of the Week: Apple, Stanford, OpenAIâŚ
AI Research of the Week:
Toward Infinite Attention, the Breakthrough Google was Waiting for?
Why Grok 1.5 Vision Might Safe Tesla
đ¨ Weekâs Update đ¨
Welcome back!
As usual in AI, the week has plenty of new announcements. For starters, we got ourselves the first important firing of researchers for leaking information. And the executor was none other than OpenAI.
Apple announced that all new AI enhancements coming to the iPhone with iOS 18 will be privacy-preserving by being âon-deviceâ.
In laymanâs terms, the iPhone wonât require an Internet connection to run your AI systems. This is important because, as we saw last week, Apple will introduce Multimodal Large Language Models into their stack to revamp Siri, among other products.
Contrary to what many believe, MLLMs do not require Internet connection. As long as you can store the weights file and the executable on your device, you are good to go.
Appleâs research on Flash LLMs starts to make much more sense nowâŚ
For the seventh year straight, Stanford has released its AI Index Report, with notable insights on how AI is accelerating scientific progress (more on that in Sundayâs Leaders newsletter) and how regulators worldwide are increasing their pace.
Moving on, Google Deepmind has unveiled the ALOHA project to build low-cost highly-dexterous robots. The videos uploaded by researchers show robots tying shoelaces, or hanging a shirt.
For the next news, the video speaks for itself. This robot coming from your darkest nightmares is Boston Dynamicsâ latest humanoid.
Fun fact, Boston Dynamics has historically avoided using AI. I wonder if that continues to be the case as Tesla or Figure.ai rely entirely on LLMs to work.
On the hardware side of things, Logitech has unveiled an âAI mouseâ that in reality, itâs simply your standard mouse seamlessly integrated with ChatGPT to access the LLM much faster and efficiently.
Itâs nice to see AI being embedded into products. Canât wait to see Logitech brand itself as âan AI companyâ though, get ready for that nonsense too!
On a final note, letâs get our tin hats on, as reports suggest that OpenAI and Metaâs new models (potentially GPT-5 and LLaMa 3) will show a stark improvement in reasoning and planning.
Surely not AGI, but something better than what we have seen until now.
đ Toward Infinitely-Long Sequences đ
Long sequences are an absolute unlocker for multiple use cases like DNA processing, long summarization, or video processing, which require orders of magnitude more data than text.
This week, Google presented Infini-attention, a new attention mechanism variant that promises the capacity to scale input sequences to infinite lengths.
Whatâs more, as weâll see in a minute, itâs highly likely this is the breakthrough that allowed Google to release Gemini 1.5 weeks after announcing Gemini 1.0 but with an increased 1-million-token context window, a ten-fold increase.
But to grasp why this is the case, we first must understand what were the current issues preventing models from scaling to long sequences.
The Gift that Keeps on Giving
When it comes to the world of Large Language Models (LLMs), be that ChatGPT or Gemini, everything today is a Transformer, a specific type of architecture that enforces the attention mechanism.
The core principle is, that to comprehend what is being sent, an LLM uses attention to update the meaning of words concerning other words in the sequence.
Just like in the example âI love my baseball batâ humans realize âbatâ refers to a baseball club and not an animal by looking at the other words in the sequence, LLMs do essentially the same.
In practice, paying attention means computing the similarity between words. If the attention score between two words is high (for instance a noun and the adjective referring to it) then these wordsâ âmeaningâ gets updated with the information provided by the other word.
Before attention, âbatâ had multiple possible meanings. After attention, itâs undeniably a baseball bat. Itâs that simple.
This concept yields amazing results, but it comes at an expensive cost.
The Great Burden
As every word has to pay attention to every single other word in the sequence to compute its attention score, we have N2 computations for a sequence of length âNâ.
In technical terms, this means that the attention mechanism implies a quadratic complexity with respect to the input sequenceâs length. If we double the sequence length, the computation cost quadruples. If we triple, the cost increases nine-fold.
Memory costs also have a quadratic complexity to the input sequence because of the KV Cache, which is actually a bigger problem for long sequences.
This forces AI engineers to abruptly limit the maximum length a Transformer can process, a term we describe as a context window. Besides being a nuisance when wanting to process large sequences, this has another immediate impact: the model is forced to forget what it saw earlier.
For instance, if we have a 100k context length (75k words) and we arrive at the limit, for future predictions the model stops paying attention to the first tokens in the sequence, which means it forgets those parts.
But with Infini-attention, we might be on the verge of a new trend: infinite sequences despite bounded compute and memory.
Stop Forgetting
In standard attention, LLMs can only attend to the data in its current context window, aka âit can only attend to the last x words in the sequenceâ.
The rest are simply forgotten.
To mitigate this, the Google researchers introduce the concept of compressive memory. In laymanâs terms, the model not only has access to the most recent text, but also to a compressed summary of the past.
Source: Google
If we recall the principles of attention, this idea of âcomputing similarityâ between words is done through a three-vector projection. In simple terms, each word generates three vectors:
A query, which states what the word is looking for
A key, which states what information the word provides
A value, which states what information the word will provide to tokens that decide to attend to it
So what do LLMs do to access their past memory? They use the query vectors to search and retrieve the most valuable data from the past.
But at this point you may be wondering, but how does this solve anything? Hereâs the crucial point: Infini-attention incorporates a different attention mechanism, linear attention, which is much cheaper.
Toward subquadratic retrieval
Without this last point, you would be dead on, we are essentially doing the same thing. However, as mentioned, the way the model accesses data from previous segments is different.
In standard attention, as sequences grow, everything grows too. But if you look back, youâll notice I mentioned the word âcompressedâ.
In laymanâs terms, the memory size over segments in the past is compressed and fixed (depicted in green in the previous image).
Consequently, although the model has fully-fledged access to all the data on the most recent segment (meaning that the attention mechanism in this segment is the same as always) the data from past segments is aggregated.
For full understanding, letâs see a comparison:
Letâs say we have a book with 12 chapters, each with 1,000 words.
- We want our standard-attention LLM, which has a context window of 1,000 words, to read it. If the model has just read Chatper 12, it has access to absolutely every single detail in this chapter. However, it has completely forgotten everything that took place in the previous 11.
- On the other hand, an Infini-attention LLM not only has full access to all info in Chapter 12, but also a summary of the past 11 chapters.
Of course, this means we are incurring information loss from the past, but thatâs a game-changing improvement from standard LLMs that do not have any memory at all!
But how is this technically implemented?
The Return of Linear Attention
In simple terms, the model carries a state, a summary of the past, in a similar fashion to LSTMs and RNNs. This memory can be retrieved, and can also be updated.
Consequently, every time a new segment is read, the memory gets updated by aggregating the knowledge from the past with the new knowledge, in a recurrent fashion.
Importantly, as this âmemoryâ is compressed accessing it is much cheaper.
This is beautifully explained by the original proponents of linear attention, but the key intuition is that, instead of computing N2 attention scores between all words, they aggregate the key and value vectors into a compressed global context vector.
Using the book example, this is analogous to thinking that, instead of having to review the entire first eleven Chapters word by word, we can recall the summary, which is far cheaper and faster.
But I donât want to bore you with unnecessary detail. The way Infini-attention works boils down to two components:
For words in the last segment (akin to the context window) the model performs standard dot-product attention, with all the advantages and quadratic disadvantages it entails.
But to access its past memory, the model uses linear attention, where instead of performing N2 computations, the model only has access to a heavily compressed summary of the past.
One last point worth mentioning. Infini-attention, just like its traditional counterpart, has recency bias as its main inductive bias.
In other words, as the model has full access to recent words but a summary of the past, this architecture is assuming that recent context is more relevant than past context in order to model language.
Using the book analogy, to predict what will happen in Chatper 13, the model is assuming that Chapter 12 is more relevant than Chapters 1 to 11.
Results-wise, Infini-attention looks highly promising, yielding almost 100% retrieval in the passkey problem (finding a random, very specific number combination in a huge unrelated text), which means that the results they offer are a strong indication that this method is already used inside Google or even available to the public with Gemini 1.5.
What We Think
The best thing about Infini-attention is the fact that it doesnât require retraining the entire model, as it introduces minimal changes to standard attention (it just adds the compressed memory).
This explains how fast Google enhanced Gemini 1.0 into a 10-fold context-window increase of Gemini 1.5 weeks later, while these models can take months to train.
The natural next step is to analyze whether this compressed memory is enough for models to scale in length indefinitely, as the fixed-sized memory will probably start forgetting important stuff due to its limited size.
That being said, itâs really fascinating to see the convergence of Transformers with recurrent models and what seems to be the natural evolution of state-of-the-art AI to scale to billions of tokens in memory.
đ Sponsor of the Week đ
Learn AI in 5 Minutes a Day
AI Tool Report is one of the fastest-growing and most respected newsletters in the world, with over 550,000 readers from companies like OpenAI, Nvidia, Meta, Microsoft, and more.
Our research team spends hundreds of hours a week summarizing the latest news, and finding you the best opportunities to save time and earn more using AI.
đď¸ Is Grok 1.5Vision the Solution to Teslaâs FSD? đď¸
Grok 1.5 Vision, the new super Multimodal Large Language Model, could have an impact well beyond expected.
In fact, it could be the sign Tesla has been waiting for to take its Full Self-Driving (FSD) mode to the next level.
But how?
Overhyped but Underdelivered
Look, I own a Tesla Model 3, so I am as biased as they get.
And still, to this day, havenât touched the FSD mode, because itâs equally expensive as it is underwhelming.
Nevertheless, while Teslaâs FSD mode is still an SAE Level 2, other car manufacturers have released Level 3 vehicles already, like Mercedesâ EQS sedan and S-classes, for instance.
SAE levels are the 5 levels of autonomous driving according to the Society of Automotive Engineers, from no automation (Level 0) to full automation (Level 5).
Level 2 essentially means that the car is not trusted to monitor the road, period.
Unsurprisingly, the FSD, despite the very recent cost reduction, is as unpopular as ever, on par with overall sentiment regarding this industry, where only 9% of Americans actually trust these systems according to AAA.
Adding insult to injury, Teslaâs bet on camera-based methods instead of LiDAR is quite controversial, as many consider the latter superior for complex situations.
Teslaâs FSD is also object-based (detects objects, predicts its trajectories, and takes decision based on those predictions).
Other labs are proposing occupancy-based systems, where instead of measuing object trajectories, the carâs chooses new trajectory based on the probability of a position being occupied by other cars or not by predicting collision probability. Thereâs no consensus on which option is best.
Overall, Teslaâs FSD seems to be in the perfect storm: overhyped, overall low trust, and considered completely unsuited for edge cases or highly complex situations.
So, how could Grok 1.5Vision change that?
MLLMs as high-level planners and reasoners for Cars
In case you arenât aware, Elon Musk not only owns Tesla, but an AI company known as xAI.
A few days ago, they announced Grok 1.5 Vision, a MLLM that matches the performance of the âgreat threeâ: GPT-4 (OpenAI), Claude 3 (Anthropic), and Gemini (Google), in most popular benchmarks:
Word of caution: Grok 1.5Vâs results have yet to be proven âin the wildâ, so take this with a pinch of salt.
Still, the MLLM looks pretty good if you look at the examples in the link at the beginning of this article.
But what is an MLLM? In simple terms, MLLMs are Large Language Models paired with the capacity to also process other data modalities, like images, audio, or videos.
But what does this have to do with Teslaâs FSD? Well, everything.
Out of all the main constraints that prevent FSD systems from becoming more autonomous is their capacity to handle âedge casesâ, unexpected complex situations that the model has rarely seen in the training data (or ever), which are the perfect situation for the car to fail.
In practice, camera-based FSD software like Tesla consistently scrutinizes the data coming from the carâs cameras and sensors to predict the current state of the driving environment, with the carâs policy deciding which next action to take.
For example, if the car approaches a STOP sign, the modelâs action should be to bring itself to a halt. But the key differentiator between standard driving and edge cases is the thinking mode they require.
Picture yourself driving. In most cases, you arenât really thinking about the actions you take, like switching gears, changing lanes, or pushing the brake pedal.
These are instinctive actions based on the thousands of hours youâve already driven. This is âSystem 1 thinkingâ based on the late Daniel Kahnemanâs theory of thinking modes.
But in very weird situations youâve never seen before, or in situations where you are lost, you might need to plan your actions.
It still has to be fast, but you are anyway thinking about your actions instead of instinctively reacting. This is what Daniel described as âSystem 2â.
Therefore, although our current FSD systems seem to work fine in events requiring System 1 decisions, they suffer in System 2 ones. But what if we could use MLLMs to provide this reasoning logic to cars and enhance their decisions?
In fact, we already have a precedent.
The Lingo-1 Model
The UK-based company Wayve is already exploring this idea, with examples as the ones below, where the self-driving car explains the decisions it makes:
Source: Wayve
In fact, Lingoâs newest version was just announced yesterday.
Could this synergy pave the way for autonomous cars to perform well even in edge cases that require planning and complex decision-making?
Whatâs more, as Elon Musk himself admitted, thereâs a tremendous data upside for Tesla, which already holds an unfathomable library of real-life video of real-life car events.
This could propel Teslaâs FSD to a completely new level while also catapulting xAIâs effort with Grok to an unprecedented scale.
Letâs not forget that Elonâs goal with xAI was to create AGI, and real-world video seems to be gaining presence as the most critical data source for that event, as acknowledged by OpenAI too.
What We Think
Talking cars feel too cyber-punky, but they seem feasible by todayâs standards. However, several questions remain before powerful MLLM-enhanced FSD software becomes a thing:
Letâs not overhype, MLLMs arenât great reasoners today. In fact, âSystem 2â reasoning and planning have been largely unachieved and are being obsessively studied by all main AI research labs from OpenAI to Google Deepmind.
Camera & object-based FSD software isnât universally considered the best option. As we saw earlier, many others in the industry are betting strongly on LiDAR.
Latency. In FSD systems, latency is everything, as cars need to make millisecond decisions based on huge quantities of data. And while LPUs like that of Groq show great promise, latency isnât particularly Large Language Modelsâ strongest suit today.
That being said, cars that combine their FSD software with MLLMs could see improved reasoning in complex cases, while also adding an extra layer of explainability, as the car itself can âreasonâ its actions, which could be a fundamental lever in improving overall trust in the systems while also helping engineers debug them.
Give a Rating to Today's Newsletter |
Do you have any feelings, questions, or intuitions you want to share with me? Reach me at [email protected]