• TheTechOasis
  • Posts
  • An Army of ChatGPT Killers & Google's Huge Controversy

An Army of ChatGPT Killers & Google's Huge Controversy

🏝 TheTechOasis 🏝

Breaking down the most advanced AI systems in the world to prepare you for the future.

5-minute weekly reads.

TLDR:

  • AI Research of the Week: An Army of ChatGPT Killers

  • Leaders: Understanding Google’s Racist Controversy

🤩 AI Research of the week 🤩

Predibase, an LLM service company, has announced a set of models that might change the course of Generative AI at the enterprise level.

Named LoraLand, this suite of very small models outperforms ChatGPT in very specific downstream tasks, all the while being able to deploy all models in the same GPU and with an average of 8 dollars spent on fine-tuning, which sounds absolutely outrageous.

A complete revolution in Enterprise Generative AI—which you are about to try—might be among us, so let’s dive in.

The Hardest Decision

The emergence of foundation models, pre-trained models that can be used for a plethora of different tasks, officially ignited the flame of Enterprise AI.

Before that, deploying AI was a challenge, and a very risky one.

You had to train a specific model for your tasks, and the chances of failure were almost 90%, according to Gartner.

But now AI comes with priors, models that we already know are very good, and that we simply need to ground to the task at hand, usually using prompt engineering, which essentially means optimizing the way we interact with the model to maximize performance.

However, at the end of the day, it all boils down to choosing the right LLM for the right task, which is unequivocally one of the hardest decisions in Generative AI today.

But at first, if you’re just starting, the decision might seem quite straightforward.

Proprietary vs open-source

A quick look at the most popular benchmarks and you’ll see that ChatGPT, Gemini, and Claude are the best models.

Also, they are heavily capitalized, offering dirt-cheap prices. However, using these proprietary models comes at a cost, and not an economic one.

You have no control over the model.

This means that it could be updated unexpectedly and force you into reengineering your deployments, and you also need to trust these companies not to use your confidential data being sent to these models.

And, for the sake of explanation, be mindful that these companies have every incentive in the world to use your data to further refine their models.

On the other hand, open-source models like LLaMa or Mixtral8×7B, while offering generally lower quality, allow you to have absolute control over the model.

Also, you have guarantees that your confidential data, the most important asset for companies today, is never compromised.

And adding insult to injury, the quality gap with private models can be closed using fine-tuning.

Data comes first, size second

Even though open-source models indeed lag behind today, with enough fine-tuning on a particular task, you can increase the performance of your open-source model far beyond what ChatGPT can offer.

There are plenty of examples of models that are more than ten or hundred times smaller than GPT-4 beating it with enough fine-tuning.

However, the issues with this are two-fold when using the conventional fine-tuning approach:

  • The Fine-tuning trade-off. Fine-tuning on a particular task sacrifices model generability for greater performance on that task due to knowledge forgetting.

For most enterprise use cases, this is not a problem. To create a customer service assistant you don’t need it to be able to rap about Norwegian salmon.

  • The business case changes. Instead of paying a price per token as OpenAI offers, LLM serving companies will now charge you for dedicated instances, as your fine-tuned model will have to run in an independent, dedicated GPU model/cluster, which is far more expensive.

For example, doing this yourself in AWS with on-demand pricing for a g5.2xlarge to serve a custom llama-2-7b model will cost you $1.21 per hour, or about $900 per month to serve 24x7, for every instance of the model, which of course can scale abruptly.

Therefore, even though fine-tuning is the ideal case scenario for most enterprise use cases that generally require top performance in one task, it’s prohibitively expensive.

But with Predibase’s announcement, things have now changed.

An Army of Experts

Predibase seems to have achieved the impossible: offering models superior to the best models out there in a cost-efficient way.

Specifically, they have released 25 fine-tuned versions of Mistralx7B that are superior to ChatGPT’s most advanced version for a particular task.

They have built this success along two dimensions:

  • QLoRA fine-tuning

  • LoRAX deployment framework

Quantized low-ranking fine-tuning

In QLoRA, two aspects come into play:

  1. Quantization, where we reduce the precision of the parameters stored in memory. Think about it as instead of storing the parameters of the model in full precision, like 0.288392384923, you round and store them as 0.3, obviously incurring a rounding error for reduced memory constraints.

  2. LoRA fine-tuning, where we optimize a network by fine-tuning only a small portion of the weights.

With regard to the latter, the idea is simple:

The conventional approach to fine-tuning is updating all the parameters in a network to train a model on the new set of data, which is very expensive considering we are talking about billions of parameters updated multiple times.

In LoRA, you only update a very small portion by benefiting from the fact that most downstream tasks an LLM performs are intrinsically low-rank.

But what does that mean?

The rank of a matrix is the lowest number of rows or columns that are linearly independent, meaning that you can’t reconstruct them by combining other rows or columns.

In simple terms, it represents a measure of the information redundancy in a matrix, meaning that those rows/columns that are linearly dependent do not add additional information.

In other words, to get optimal results we just need to optimize the small meaningful portion of the weights.

So what do we do?

Well, we take the matrix of weights of the model and decompose it into two matrices that represent its low-rank equivalents.

As you can see below, we can decompose a 5-by-5 matrix into two matrices 2-by-5 and 5-by-2 (with 2 being the rank of this particular matrix).

Consequently, we do not update the full-sized matrix, but the low-rank ones, reducing the number of parameters to update from 25 to 10 (in this particular case).

This works because in most cases LLMs have much higher parameters than needed for any given task, meaning we don’t have to update the whole matrix to obtain the same results.

And this benefit compounds incredibly well as you can see in the image below, where even when you have a rank of 512 (compared to 2 as we saw before) you are still only updating 1.22% of the total weights of the model while still getting top improvements.

Adding to this, as the original base model has been quantized, the memory requirements for handling the base model, in this case Mistralx7B, drop dramatically, increasing the overall impact.

However, we are still having the biggest problem: every fine-tuned model requires its dedicated GPU.

And this is where LoRAX comes in.

One GPU, 100 LLMs

The idea behind LoRAX is that, as each adapter adds its own set of weights to the original model, the model’s base weights are the same independently of the fine-tuned version used, as long as all fine-tunings are based on the same base model (in this case, Mistralx7B).

Consequently, what LoRAX allows is to efficiently manage a set of fine-tuned models that are loaded and evicted dynamically from one single GPU depending on the types of requests users send:

Source: Predibase

For a more detailed explanation of LoRAX, read here.

In concise terms, depending on the different requests the users send, the model automatically detects what adapters (fine-tuned models) are required in any given case, and loads their weights into the base model (which remains unchanged as we described earlier).

Source: Predibase

And what does all this sum up to?

Well, for an average approximate price of 200 dollars (8 dollars per fine-tuning), you have a set of 25 models that individually outcompete GPT-4 in their given tasks, while running in one single A100 GPU, with all the added benefits of having total control over your models.

Open-source’s ‘it moment’?

All in all, PrediBase might just have given us a glimpse of the future of Enterprise GenAI, as it’s becoming harder and harder to look away from open-source as the primary solution for companies willing to embrace the GenAI revolution, considering the quality/price ratio they are starting to offer.

Try LoraLand models for free here.

👾 Best news of the week 👾

🙃 The age of sovereign AI by Jensen Huang, CEO of NVIDIA

🥇 Leaders 🥇

Understanding Google’s Huge Racist Controversy

Once again, Google has proven that they are as equally capable of delivering generational software like Gemini 1.5, as destroying its reputation by making Gemini’s current version outright racist and immoral.

All in the same week.

Unsurprisingly, people all around the world are saying Gemini is racist and immoral.

However, that’s not accurate.

As a matter of fact, the issue is much, much worse, to the point that proprietary models like Gemini or ChatGPT could end up being totally unusable.

Today, we are getting into the weeds on how a mixture of uncanny strategic decisions, covert racism, and the justification of acts like pedophilia, has caused Google to shut down Gemini’s image generation capabilities in one of the major flops in AI history.

And more importantly, we will uncover what really caused this.

Subscribe to Leaders to read the rest.

Become a paying subscriber of Leaders to get access to this post and other subscriber-only content.

Already a paying subscriber? Sign In

A subscription gets you:
High-signal deep-dives into the most advanced AI in the world in a easy-to-understand language
Additional insights to other cutting-edge research you should be paying attention to
Curiosity-inducing facts and reflections to make you the most interesting person in the room