TheTechOasis
Posts
Microsoft just changed the game with LongNet

Microsoft just changed the game with LongNet

Ignacio de Gregorio Noblejas
July 09, 2023

🏝 TheTechOasis 🏝

According to famous AI researcher and YouTuber David Shapiro, a human will read between 1 and 2 billion tokens in his/her lifetime.

And now Microsoft’s newest architecture, LongNet, ingests 1 billion tokens, a human’s lifetime read count, in half a second.

In contrast, the most advanced AI chatbot in the world in terms of length, Claude, ‘only’ reaches 100,000 tokens, or around 75,000 words, a complete Harry Potter book.

That’s still a huge number, but it’s 10,000 times less than LongNet.

With LongNet, we enter the realm of the possibility of models that could potentially ingest the complete Internet at once, taking us closer to humanity’s greatest achievement, Artificial General Intelligence, or AGI.

But how on Earth has Microsoft built such a crazy model?

Attention is all you need… and hate

At the core of every AI chatbot in existence today, from ChatGPT to last week’s newsletter protagonist Pi, we have the Transformer architecture, what’s probably the greatest discovery in the history of AI.

And at the center of the Transformer we have the attention mechanism, the key to unlocking a machine’s capacity to understand context in text.

In simple terms, attention works by making all words in the given sequence of text ‘talk’ with each other, allowing the model to understand the relationships between them and, thus, understand the meaning of the text.

This works wonders but has a problem, it’s super computationally expensive.

Specifically, the attention mechanism used by today’s model has a quadratic relationship between text sequence length and computational cost.

In other words, if we double the length of the input text sequence, the costs to run the chatbot quadruple.

Consequently, AI companies building LLMs have no option but to limit the input sequence maximum size.

For instance, in ChatGPT’s case is 32k tokens, or 26k words.

This is a huge limitation, as humans are better suited to answer the more context they’re given.

And machines are no different, as naturally, it will be easier for a chatbot to summarize a book if you provide the complete transcript than if you provide only a chapter.

But more importantly, having a vast input sequence capacity is critical to another feature of LLMs, in-context learning.

The richer the input, the better the outcome

When answering any question they are asked, unless provided in the prompt, chatbots will trust the knowledge embedded in their weights, achieved during training.

This is not ideal, as this corpus will include very valuable data, but also false data.

Luckily, one of the best features of LLMs is their few-shot capacity, or as I like to call it, “learning on the go”.

In layman’s terms, you can provide the information you need the chatbot to use in the actual prompt, and the model will successfully use that data to respond.

Needless to say, a great number of situations where AI chatbots become particularly useful involve this provision of information to the model in real-time.

Up until LongNet, providing very long texts was not an option for Generative AI, blocking their potential.

But Microsoft decided it was time to end this.

Dilating attention

LongNet is a new Transformer architecture that works with dilated attention.

Instead of having a quadratic cost to sequence length, dilated attention allows for linear cost, as you can see in the below image:

Long runtimes not only damage the user experience, it means that the model is making huge amounts of computation to output every single token, forcing labs to limit the sequence size in their models to avoid huge costs.

But with dilated attention, the runtime stays comfortably below 1 second even when the sequence is 1 billion tokens, or 750,000,000 words approximately.

That’s 10,000 Harry Potter books read in 0.5 seconds!

But how does this work?

Sparsifying vectors

LongNet isn’t our first try at reducing computational complexity.

Recently, sparse transformers were proposed, which limit the number of words one word can talk to in a sequence.

By reducing the number of operations, the costs and latency of the model fall, but this negatively impacts model quality.

But with LongNet, perplexity, a measurement of how certain the model is when predicting every word—the smaller the number the better the model is—not only doesn’t increase, but it’s smaller than in previous models:

In other words, fascinatingly, LongNet performs better while being cheaper to run.

And this is thanks to two elements:

Segment length

While every word in standard attention talks to all other words, in LongNet a segment communication length is defined, which means that words will only ‘talk’ with words in their segment.

Dilated rate

Unlike Sparse Transformers, which limit the communication between words, LongNet defines patterns of communication.

That is, each word will first talk with other words dictated by a specific pattern (for instance, every two words in the sequence).

But you may be thinking, ‘ok, but there’s a problem, by limiting the word communication, words may be missing important context from other segments, right?’.

And that’s true, but there’s a catch.

To avoid such ‘global’ loss of information, which would render the model useless, LongNet varies both variables dynamically, which means that every word will talk with words in other segments and in other dilated rates according to geometric patterns, as seen below:

Every word, representing a square, will eventually talk to many other words as the variables vary.

In summary, although every word won’t talk to every other word in the sequence, which is utterly unnecessary to model language, they still end up communicating with many other words, thus the quality of the model’s understanding of the complete sequence is still great while the computation requirements fall dramatically.

And what’s more, you can ingest every segment of the sequence to one correspondent GPU, which allows for even more efficient parallelization, a key way of optimizing GPU usage and, hence, dropping training costs even further.

A necessary step to AGI

To build AGI, humans need to allow machines to have unlimited sequence sizes.

When we humans see a gigabyte image with one billion pixels like the one below (I strictly recommend going here and zooming in, you’re going to be blown away) we don’t need to analyze every pixel, or what color is a car in fifth avenue, to know that’s New York.

Just by seeing the skyline, you just know.

Similarly, future image models based on LongNet’s dilated attention will be capable of processing similar-sized images because they won’t pay attention to every single datapoint in the image, something that would be super expensive, and just by roughly analyzing what they see and still determine, with little effort, that this is, indeed, The City that Never Sleeps.

With LongNet, a critical breakthrough has been made in helping machines navigate through information-heavy environments while paying attention only to what matters to make them sustainable.

And that, undoubtedly, takes us closer to AGI.

Link to Microsoft’s paper

Key AI concepts you’ve learned today:

- Attention mechanisms as key to today’s AI

- Dilated attention

- Perplexity, a key measure of model quality