Meta might have created a new AI monster

šŸ TheTechOasis šŸ

šŸ¤– This Weekā€™s AI Insight šŸ¤–

Nothing in this world is perfect.

ChatGPT, Claude, Bardā€¦ all have one thing in common.

They canā€™t frickinā€™ scale.

However, Meta (Facebook) is determined to end this with MegaByte, a new Transformer model that could put all the aforementioned models to shame by being capable of handling word counts not in the low thousands like ChatGPTā€¦

But in the millions.

Itā€™s All Fun and Games Until You See the Bill

If youā€™re an avid user of Generative AI technology like I am, ChatGPT has become a fundamental part of your tech stack.

I speedrun my study of AI research papers with dozens of papers produced every week thanks to the AskYourPDF plugin, I quickly generate awesome headlines that convey the message I want, and I plan my travels to the smallest detail with the Expedia and GetYourGuideĀ plugins.

To me, this is what peak consumer-end technology feels like.

But thereā€™s a problem.

GPT-4 not only sits behind a paywall of $20 but itā€™s also hard-capped to 25 messages every three hours.

This undeniably affects the user experience and OpenAI knows this, but it is what it isā€¦ as OpenAI is spending millions per day in running GPT-4 for us.

This has severely crippled the experience in two ways:

  • Inference (model execution) limits like the one Iā€™ve just mentioned

  • Severely limited sequence lengths

And while one can live with limiting the number of inferences you can run on premium models, the second limitation is a huge problem.

And thatā€™s precisely what Metaā€™s MegaByte is solving.

The Great Computation Problem

Transformers are some of the most expensive technology to run in the world.

For reference, LLaMa, a 62 billion-parameter model by Meta had a cost of $5 million just in training, alone.

Imagine the costs of training GPT-4, a model that could easily be 10 times larger than that,Ā or more.

And the reason these models are so expensive is due to their sheer size.

Also, the fact that the entire model is run for every token (groups of 3 to 4 letters) in the input sequence, means that the computational costs of running the model are quadratic to the input length of the sequence.

In other words, the longer the sequence of words you send to the model, the increasingly expensive it becomes to run it.

Naturally, theyā€™ve resolved to limit the amount of text you can send to these models.

Long books, podcasts, or videos are simply out of the question due to their sheer size.

Wellā€¦ that was before MegaByte.

A Genius ā€œCompromiseā€

As you may imagine by now, MegaByte was designed with one sole objective, to allow long-sequence modeling, prohibitive to models like ChatGPT.

Incredibly, Meta claims it can handle 1 million tokens, or around 750,000 words.

For reference, ChatGPT allows, at the time of writing, approximatelyĀ 6,000 wordsā€¦ at most.

ā

This means that MegaByte proposes a x125 improvement over ChatGPTā€™s best model.

But how does MegaByte pull it off?

MegaByte is comprised of three different parts:

  1. Patch embedder: Upon receiving the input sequence the user gives to the model, it breaks it into ā€œpatchesā€, fixed-size chunks of the original sequence, and embeds them (transforms natural language text into numerical vectors machines can understand)

  2. Global model: All these embedded patches are fed, at the same time, into a model that performs the same process any other Transformer like ChatGPT would do, self-attention. This updates the embedding of each patch with information about its position in the sequence and its relationship to other parts of the text.

  3. Local model: For every patch, a local, much smaller model decodes (transforms) each patch into its original text form, learning how to successfully imitate human language.

But how does this design make MegaByte orders of magnitude more scalable?

Greater Parallelism Implies Smaller Costs

GPUs, the processing units behind most AI today, are capable of processing data in parallel.

Transformers benefit from them because weā€™re capable of treating all tokens in the sequence at once.

However, actual Transformers like ChatGPT generate data sequentially, as the output of the last generated text is used as input for the next one.

And hereā€™s where MegaByte revolutionizes AI, as all patches in the sequence are processed and generated by the model at the same time.

In other words, although the order in which patchers are presented to the user follows a logical order, the necessary computations to decode every patch are done at the same time, severely increasing generation speed while reducing costs as GPU parallelization capabilities are maximized.

And as the generation is done byte-per-byte inside each patch, the size requirements for the local model are much smaller, which means that the total number of parameters in MegaByte is considerably smaller than other models like GPT-4 or LLaMa, while rivaling them in performance.

Therefore, as computational requirements are drastically reduced, we can considerably increase the sequence length to millions of tokens, an unparalleled milestone for Generative AI.

Itā€™s too early to tell, but if proven right, MegaByte could become the standard Transformer, and Meta could soon become the new sheriff in AI town.

Key AI concepts youā€™ve learned by reading this newsletter:

- The greatest problem in GenAI, scalability

- The critical importance of increasing sequence length

- The next step in AI research: parallelizing computation and reducing size

šŸ‘¾Top AI news for the weekšŸ‘¾

šŸŒ¹Ā Amazing de-aged deep fake Harrison Ford in ā€˜Cowboys and Aliensā€™ with Stable Diffusion

šŸŖŸ Game-changer Windows Copilot announced by Microsoft

šŸŒ‹Ā Fake AI image causes markets to fall $500 billion

šŸ˜£Ā OpenAI may have to leave the EU šŸ‡ŖšŸ‡ŗ

šŸ§Ā TikTok is testing Tako, a GenAI-powered chatbot

šŸ˜Æ New research proves some machines learn language as we do

šŸ’ø Microsoftā€™s awesome quarterly results have resulted inā€¦ pay cuts