• TheTechOasis
  • Posts
  • Mixtral of Experts, the New Open-Source King, and why 2024 will be the year of AI Robotics

Mixtral of Experts, the New Open-Source King, and why 2024 will be the year of AI Robotics

šŸ TheTechOasis šŸ

Breaking down the most advanced AI systems in the world to prepare you for your future.

5-minute weekly reads.

TLDR:

  • AI Research of the Week: Mixtral of Experts, Europeā€™s AI Racehorse

  • Leaders: 2024, the Year of Robots

šŸ¤Æ AI Research of the week šŸ¤Æ

Just like any other week these days, a new open-source model has come out.

But this time, things are different.

Not only the model has been released in a very geeky fashion through a peer-to-peer torrent network.

The model itself is, well, different.

Emulating one of the core features that turned OpenAIā€™s GPT-4 into the worldā€™s most advanced model (with the excuse of Gemini Ultra), Mistralā€™s new model, Mixtral 8Ɨ7B, is the first open-source Sparse Mixture-of-Experts foundation model that is as impressive as it is highly performant, making it the best open-source model to date.

But it isnā€™t stopping there, as it is up to six times faster than models of its size, making it the best model in the world in terms of performance relative to cost and speed.

Europe seems to have found its AI champion, and today we are going to make sense of this engineering marvel.

The Feedforward Layer Problem

When you are at the frontlines implementing Generative AI, you know how to cut through the bullshit.

If you just look at what journalists and bloggers alike say, you will think that ChatGPT et al are already vastly used around the globe.

Far from it.

Just do the numbers

Yes, the users can be counted in the millions, but if you look at what corporations are doing, to this date, the numbers are far more modest.

Sure thereā€™s interest, but most Generative AI projects bump into the same obstacle:

Costs.

Costs can be divided into two:

  • Hosting costs: How expensive is it to have your model in a GPU cluster

  • Inference costs: How much does it cost to run the model

Regarding the former, if we take LLaMa 2 70-billion-parameter model, it has a float16 precision. In other words, each parameter occupies 16 bits in memory, or 2 bytes.

Thus, if we have 70 billion parameters, that means that the weight file occupies 140 GB of memory.

Taking the new NVIDIA GPU, the H200, it has a capacity of 141 GB, which means LLaMa 2 almost doesnā€™t fit in a state-of-the-art GPU. However, you need to account for inference costs, so you will need at least another one.

Best practices suggest up to 4 times your nominal capacity, so you would need 4, which in H200 that means at least $100K of CAPEX investment, and thatā€™s in the low-end of the spectrum for just one instance, without counting the actual hosting costs.

Regarding inference costs, the issue comes when you realize that in Transformer-based models (basically all NLP foundation models today), for every token prediction, the entire model is queried.

For. Every. Token.

This is why we need to store the entire model in memory in the first place.

Of course, you can simply use OpenAIā€™s APIs and forget about all this, but anyone who has run ChatGPT-4 in a production environment knows what that meansā€¦ šŸ’øšŸ’øšŸ’ø

But where is the root of the problem?

According to Metaā€™s MegaByte paper, in large-scale vanilla Transformer models, up to 98% of FLOPs (mathematical operations that the model does to compute the inference cycle) take place in feedforward layers (FFs).

An essential part of such architectures, FFs help extract important features of the data at the expense of requiring heavy computations.

But is there a way of making this process more efficient?

It turns out that yes, and thatā€™s precisely what Mixtral 8Ɨ7B does.

A Sparse Marvel

Mistralā€™s new model can be considered as a blend of 8 different 7-billion-parameter models, as many have put it.

But, in reality, itā€™s not.

Itā€™s one model, but with a twist on FFs.

If we look at a standard feedforward network, when an input is inserted, all neurons in the network are queried, meaning that a huge amount of computation is required to reach a result.

In a mixture-of-experts case, only a small part of the neurons are queried, reducing the number of calculations by a certain constant.

In Mixtral 8Ɨ7Bā€™s case, every feedforward layer is divided into, you guessed it, 8 parts, or experts.

Hence, for every input, a gating network decides which 2 out of the 8 experts will be queried to provide the result.

The gating network is just a bunch of neurons with trainable weights (meaning this gate is also trained with the rest of the model) with a softmax function at output.

This softmax function assigns a percentage to every expert, and the 2 with the highest percentage are chosen to participate in the prediction.

In essence, what we are doing here is training a gate that learns to map each input the FF receives into the 2 best experts for that case.

In other words, we are forcing each one of the parts of the layer (groups of neurons) to become an expert on specific types of inputs and, thus, releasing the other experts from having to learn to predict well for every input.

The result?

We can increase the size of models greatly, something we want as the bigger the number of parameters, the more stuff can the model learn, while increasing costs by a much smaller factor.

In our case, although Mixtral 8Ɨ7B has 46.7 billion parameters, for every token prediction only 12.9 billion parameters run.

Consequently, as cost is measured as the amount of computations required, costs drop by a factor of 4 approximately, while reducing inference time by a factor of 6 (this according to Mistral).

But how good is the model really?

The Best Pound-for-Pound

In nominal terms, Mixtral 8Ɨ7B is a really good model, surpassing GPT-3.5 and LLaMa 2 70B in almost all metrics.

Source: Mistral

However, when compared to models like GPT-4 or Gemini Ultra, one could argue that the model is around a year behind the big guys.

But does that really matter?

For you and me and the usual questions we might ask, sure.

But at the enterprise level, everything comes relative to cost. And just like Walmart offers great value products, nothing beats Mixtral 8Ɨ7B per dollar invested.

This, added to the fact that Mistralā€™s models will be available in Azure and Google Cloud as well as through their brand new platform, makes this model a CTOā€™s dream and a staple of what enterprise GenAI will be all about.

On a sidenote, Mixtral 8Ɨ7B raises safety concerns, as the model will respond to anything unless you use the following alignment prompt.

Therefore, just like Napoleon seemed invincible during the time stretch between the Battle of Austerlitz in 1805 and the Battle of Wagram in 1809, this French model is going to be tough to beat given the value it gives you for your buck.

šŸ«” Key contributions šŸ«”

  • Mixtral 8Ɨ7B becomes the best open-source model and pound-for-pound the best overall model using mixture-of-experts feedforward layers that achieve amazing performance by a factor of costs and latency

  • It also signals to the world that Europe is catching up on AI, meaning that the US and China arenā€™t the only ones playing the game

šŸ”® Practical implications šŸ”®

  • Due to Mixtralā€™s amazing performance per cost, it becomes one of the first AI ā€˜great valueā€™ models that offer the best quality over price for corporations to deploy GenAI at scale

šŸ‘¾ Best news of the week šŸ‘¾

šŸ§ Azure AI Studio expands its offering with LLaMa and GPT-4V

šŸ«¢ OpenAI releases best practices for agentic systems

šŸ˜ Samsung unveils its GenAI model Gauss

šŸ„‡ Leaders šŸ„‡

2024 Will be the Year of Robots

As much as AI has conquered the digital world in many aspects, it is still much of a novice when it comes to the real world.

But I bet that this is changing next year.

Key players in the industry like Demis Hassabis, Google Deepmindā€™s mastermind, or even Elon Musk through Tesla, are putting their focus on the convergence between robotics and AI.

Thus, today, we are going down an evidence-based journey to convince you that embodied intelligence, when AI gains physicality, is much, much closer than many people even realize.

We will see how AI has conquered smell, how it is improving its proprioception capabilities at insane speeds, and how it even manages to do some of the most challenging tasks in CGI by itself without human collaboration in a self-improvement loop, one of the most incredible yet worrisome new avenues of AI innovation.

But not only that, we will also go further and analyze the signs that AI is not only getting closer to usā€¦ it is actually going superhuman.

Read at your discretion, as this weekā€™s Leaders issue might give you the willies.

Subscribe to Full Premium package to read the rest.

Become a paying subscriber of Full Premium package to get access to this post and other subscriber-only content.

Already a paying subscriber? Sign In.

A subscription gets you:

  • ā€¢ NO ADS
  • ā€¢ An additional insights email on Tuesdays
  • ā€¢ Gain access to TheWhiteBox's knowledge base to access four times more content than the free version on markets, cutting-edge research, company deep dives, AI engineering tips, & more