- TheTechOasis
- Posts
- Microsoft just changed the game with LongNet
Microsoft just changed the game with LongNet
đ TheTechOasis đ
According to famous AI researcher and YouTuber David Shapiro, a human will read between 1 and 2 billion tokens in his/her lifetime.
And now Microsoftâs newest architecture, LongNet, ingests 1 billion tokens, a humanâs lifetime read count, in half a second.
In contrast, the most advanced AI chatbot in the world in terms of length, Claude, âonlyâ reaches 100,000 tokens, or around 75,000 words, a complete Harry Potter book.
Thatâs still a huge number, but itâs 10,000 times less than LongNet.
With LongNet, we enter the realm of the possibility of models that could potentially ingest the complete Internet at once, taking us closer to humanityâs greatest achievement, Artificial General Intelligence, or AGI.
But how on Earth has Microsoft built such a crazy model?
Attention is all you need⌠and hate
At the core of every AI chatbot in existence today, from ChatGPT to last weekâs newsletter protagonist Pi, we have the Transformer architecture, whatâs probably the greatest discovery in the history of AI.
And at the center of the Transformer we have the attention mechanism, the key to unlocking a machineâs capacity to understand context in text.
In simple terms, attention works by making all words in the given sequence of text âtalkâ with each other, allowing the model to understand the relationships between them and, thus, understand the meaning of the text.
This works wonders but has a problem, itâs super computationally expensive.
Specifically, the attention mechanism used by todayâs model has a quadratic relationship between text sequence length and computational cost.
In other words, if we double the length of the input text sequence, the costs to run the chatbot quadruple.
Consequently, AI companies building LLMs have no option but to limit the input sequence maximum size.
For instance, in ChatGPTâs case is 32k tokens, or 26k words.
This is a huge limitation, as humans are better suited to answer the more context theyâre given.
And machines are no different, as naturally, it will be easier for a chatbot to summarize a book if you provide the complete transcript than if you provide only a chapter.
But more importantly, having a vast input sequence capacity is critical to another feature of LLMs, in-context learning.
The richer the input, the better the outcome
When answering any question they are asked, unless provided in the prompt, chatbots will trust the knowledge embedded in their weights, achieved during training.
This is not ideal, as this corpus will include very valuable data, but also false data.
Luckily, one of the best features of LLMs is their few-shot capacity, or as I like to call it, âlearning on the goâ.
In laymanâs terms, you can provide the information you need the chatbot to use in the actual prompt, and the model will successfully use that data to respond.
Needless to say, a great number of situations where AI chatbots become particularly useful involve this provision of information to the model in real-time.
Up until LongNet, providing very long texts was not an option for Generative AI, blocking their potential.
But Microsoft decided it was time to end this.
Dilating attention
LongNet is a new Transformer architecture that works with dilated attention.
Instead of having a quadratic cost to sequence length, dilated attention allows for linear cost, as you can see in the below image:
Long runtimes not only damage the user experience, it means that the model is making huge amounts of computation to output every single token, forcing labs to limit the sequence size in their models to avoid huge costs.
But with dilated attention, the runtime stays comfortably below 1 second even when the sequence is 1 billion tokens, or 750,000,000 words approximately.
Thatâs 10,000 Harry Potter books read in 0.5 seconds!
But how does this work?
Sparsifying vectors
LongNet isnât our first try at reducing computational complexity.
Recently, sparse transformers were proposed, which limit the number of words one word can talk to in a sequence.
By reducing the number of operations, the costs and latency of the model fall, but this negatively impacts model quality.
But with LongNet, perplexity, a measurement of how certain the model is when predicting every wordâthe smaller the number the better the model isânot only doesnât increase, but itâs smaller than in previous models:
In other words, fascinatingly, LongNet performs better while being cheaper to run.
And this is thanks to two elements:
Segment length
While every word in standard attention talks to all other words, in LongNet a segment communication length is defined, which means that words will only âtalkâ with words in their segment.
Dilated rate
Unlike Sparse Transformers, which limit the communication between words, LongNet defines patterns of communication.
That is, each word will first talk with other words dictated by a specific pattern (for instance, every two words in the sequence).
But you may be thinking, âok, but thereâs a problem, by limiting the word communication, words may be missing important context from other segments, right?â.
And thatâs true, but thereâs a catch.
To avoid such âglobalâ loss of information, which would render the model useless, LongNet varies both variables dynamically, which means that every word will talk with words in other segments and in other dilated rates according to geometric patterns, as seen below:
Every word, representing a square, will eventually talk to many other words as the variables vary.
In summary, although every word wonât talk to every other word in the sequence, which is utterly unnecessary to model language, they still end up communicating with many other words, thus the quality of the modelâs understanding of the complete sequence is still great while the computation requirements fall dramatically.
And whatâs more, you can ingest every segment of the sequence to one correspondent GPU, which allows for even more efficient parallelization, a key way of optimizing GPU usage and, hence, dropping training costs even further.
A necessary step to AGI
To build AGI, humans need to allow machines to have unlimited sequence sizes.
When we humans see a gigabyte image with one billion pixels like the one below (I strictly recommend going here and zooming in, youâre going to be blown away) we donât need to analyze every pixel, or what color is a car in fifth avenue, to know thatâs New York.
Just by seeing the skyline, you just know.
Similarly, future image models based on LongNetâs dilated attention will be capable of processing similar-sized images because they wonât pay attention to every single datapoint in the image, something that would be super expensive, and just by roughly analyzing what they see and still determine, with little effort, that this is, indeed, The City that Never Sleeps.
With LongNet, a critical breakthrough has been made in helping machines navigate through information-heavy environments while paying attention only to what matters to make them sustainable.
And that, undoubtedly, takes us closer to AGI.
Key AI concepts youâve learned today:
- Attention mechanisms as key to todayâs AI
- Dilated attention
- Perplexity, a key measure of model quality
đžTop AI news for the weekđž
đ¨đźâđť OpenAI releases ChatGPTâs code interpreter
đ§đźââď¸ Full-body AI scans to prevent cancer
â˘ď¸ OpenAI is using 20% of its computing power to align AI
𤊠New model edits any video into the style you want