Meta might have created a new AI monster

🏝 TheTechOasis 🏝

🤖 This Week’s AI Insight 🤖

Nothing in this world is perfect.

ChatGPT, Claude, Bard… all have one thing in common.

They can’t frickin’ scale.

However, Meta (Facebook) is determined to end this with MegaByte, a new Transformer model that could put all the aforementioned models to shame by being capable of handling word counts not in the low thousands like ChatGPT…

But in the millions.

It’s All Fun and Games Until You See the Bill

If you’re an avid user of Generative AI technology like I am, ChatGPT has become a fundamental part of your tech stack.

I speedrun my study of AI research papers with dozens of papers produced every week thanks to the AskYourPDF plugin, I quickly generate awesome headlines that convey the message I want, and I plan my travels to the smallest detail with the Expedia and GetYourGuide plugins.

To me, this is what peak consumer-end technology feels like.

But there’s a problem.

GPT-4 not only sits behind a paywall of $20 but it’s also hard-capped to 25 messages every three hours.

This undeniably affects the user experience and OpenAI knows this, but it is what it is… as OpenAI is spending millions per day in running GPT-4 for us.

This has severely crippled the experience in two ways:

  • Inference (model execution) limits like the one I’ve just mentioned

  • Severely limited sequence lengths

And while one can live with limiting the number of inferences you can run on premium models, the second limitation is a huge problem.

And that’s precisely what Meta’s MegaByte is solving.

The Great Computation Problem

Transformers are some of the most expensive technology to run in the world.

For reference, LLaMa, a 62 billion-parameter model by Meta had a cost of $5 million just in training, alone.

Imagine the costs of training GPT-4, a model that could easily be 10 times larger than that, or more.

And the reason these models are so expensive is due to their sheer size.

Also, the fact that the entire model is run for every token (groups of 3 to 4 letters) in the input sequence, means that the computational costs of running the model are quadratic to the input length of the sequence.

In other words, the longer the sequence of words you send to the model, the increasingly expensive it becomes to run it.

Naturally, they’ve resolved to limit the amount of text you can send to these models.

Long books, podcasts, or videos are simply out of the question due to their sheer size.

Well… that was before MegaByte.

A Genius “Compromise”

As you may imagine by now, MegaByte was designed with one sole objective, to allow long-sequence modeling, prohibitive to models like ChatGPT.

Incredibly, Meta claims it can handle 1 million tokens, or around 750,000 words.

For reference, ChatGPT allows, at the time of writing, approximately 6,000 words… at most.

This means that MegaByte proposes a x125 improvement over ChatGPT’s best model.

But how does MegaByte pull it off?

MegaByte is comprised of three different parts:

  1. Patch embedder: Upon receiving the input sequence the user gives to the model, it breaks it into “patches”, fixed-size chunks of the original sequence, and embeds them (transforms natural language text into numerical vectors machines can understand)

  2. Global model: All these embedded patches are fed, at the same time, into a model that performs the same process any other Transformer like ChatGPT would do, self-attention. This updates the embedding of each patch with information about its position in the sequence and its relationship to other parts of the text.

  3. Local model: For every patch, a local, much smaller model decodes (transforms) each patch into its original text form, learning how to successfully imitate human language.

But how does this design make MegaByte orders of magnitude more scalable?

Greater Parallelism Implies Smaller Costs

GPUs, the processing units behind most AI today, are capable of processing data in parallel.

Transformers benefit from them because we’re capable of treating all tokens in the sequence at once.

However, actual Transformers like ChatGPT generate data sequentially, as the output of the last generated text is used as input for the next one.

And here’s where MegaByte revolutionizes AI, as all patches in the sequence are processed and generated by the model at the same time.

In other words, although the order in which patchers are presented to the user follows a logical order, the necessary computations to decode every patch are done at the same time, severely increasing generation speed while reducing costs as GPU parallelization capabilities are maximized.

And as the generation is done byte-per-byte inside each patch, the size requirements for the local model are much smaller, which means that the total number of parameters in MegaByte is considerably smaller than other models like GPT-4 or LLaMa, while rivaling them in performance.

Therefore, as computational requirements are drastically reduced, we can considerably increase the sequence length to millions of tokens, an unparalleled milestone for Generative AI.

It’s too early to tell, but if proven right, MegaByte could become the standard Transformer, and Meta could soon become the new sheriff in AI town.

Key AI concepts you’ve learned by reading this newsletter:

- The greatest problem in GenAI, scalability

- The critical importance of increasing sequence length

- The next step in AI research: parallelizing computation and reducing size

👾Top AI news for the week👾

🌹 Amazing de-aged deep fake Harrison Ford in ‘Cowboys and Aliens’ with Stable Diffusion

🪟 Game-changer Windows Copilot announced by Microsoft

🌋 Fake AI image causes markets to fall $500 billion

🧐 TikTok is testing Tako, a GenAI-powered chatbot

😯 New research proves some machines learn language as we do

💸 Microsoft’s awesome quarterly results have resulted in… pay cuts