- TheTechOasis
- Posts
- An Army of ChatGPT Killers & Google's Huge Controversy
An Army of ChatGPT Killers & Google's Huge Controversy
š TheTechOasis š
Breaking down the most advanced AI systems in the world to prepare you for the future.
5-minute weekly reads.
TLDR:
AI Research of the Week: An Army of ChatGPT Killers
Leaders: Understanding Googleās Racist Controversy
š¤© AI Research of the week š¤©
Predibase, an LLM service company, has announced a set of models that might change the course of Generative AI at the enterprise level.
Named LoraLand, this suite of very small models outperforms ChatGPT in very specific downstream tasks, all the while being able to deploy all models in the same GPU and with an average of 8 dollars spent on fine-tuning, which sounds absolutely outrageous.
A complete revolution in Enterprise Generative AIāwhich you are about to tryāmight be among us, so letās dive in.
The Hardest Decision
The emergence of foundation models, pre-trained models that can be used for a plethora of different tasks, officially ignited the flame of Enterprise AI.
Before that, deploying AI was a challenge, and a very risky one.
You had to train a specific model for your tasks, and the chances of failure were almost 90%, according to Gartner.
But now AI comes with priors, models that we already know are very good, and that we simply need to ground to the task at hand, usually using prompt engineering, which essentially means optimizing the way we interact with the model to maximize performance.
However, at the end of the day, it all boils down to choosing the right LLM for the right task, which is unequivocally one of the hardest decisions in Generative AI today.
But at first, if youāre just starting, the decision might seem quite straightforward.
Proprietary vs open-source
A quick look at the most popular benchmarks and youāll see that ChatGPT, Gemini, and Claude are the best models.
Also, they are heavily capitalized, offering dirt-cheap prices. However, using these proprietary models comes at a cost, and not an economic one.
You have no control over the model.
This means that it could be updated unexpectedly and force you into reengineering your deployments, and you also need to trust these companies not to use your confidential data being sent to these models.
And, for the sake of explanation, be mindful that these companies have every incentive in the world to use your data to further refine their models.
On the other hand, open-source models like LLaMa or Mixtral8Ć7B, while offering generally lower quality, allow you to have absolute control over the model.
Also, you have guarantees that your confidential data, the most important asset for companies today, is never compromised.
And adding insult to injury, the quality gap with private models can be closed using fine-tuning.
Data comes first, size second
Even though open-source models indeed lag behind today, with enough fine-tuning on a particular task, you can increase the performance of your open-source model far beyond what ChatGPT can offer.
There are plenty of examples of models that are more than ten or hundred times smaller than GPT-4 beating it with enough fine-tuning.
However, the issues with this are two-fold when using the conventional fine-tuning approach:
The Fine-tuning trade-off. Fine-tuning on a particular task sacrifices model generability for greater performance on that task due to knowledge forgetting.
For most enterprise use cases, this is not a problem. To create a customer service assistant you donāt need it to be able to rap about Norwegian salmon.
The business case changes. Instead of paying a price per token as OpenAI offers, LLM serving companies will now charge you for dedicated instances, as your fine-tuned model will have to run in an independent, dedicated GPU model/cluster, which is far more expensive.
For example, doing this yourself in AWS with on-demand pricing for a g5.2xlarge to serve a custom llama-2-7b model will cost you $1.21 per hour, or about $900 per month to serve 24x7, for every instance of the model, which of course can scale abruptly.
Therefore, even though fine-tuning is the ideal case scenario for most enterprise use cases that generally require top performance in one task, itās prohibitively expensive.
But with Predibaseās announcement, things have now changed.
An Army of Experts
Predibase seems to have achieved the impossible: offering models superior to the best models out there in a cost-efficient way.
Specifically, they have released 25 fine-tuned versions of Mistralx7B that are superior to ChatGPTās most advanced version for a particular task.
They have built this success along two dimensions:
QLoRA fine-tuning
LoRAX deployment framework
Quantized low-ranking fine-tuning
In QLoRA, two aspects come into play:
Quantization, where we reduce the precision of the parameters stored in memory. Think about it as instead of storing the parameters of the model in full precision, like 0.288392384923, you round and store them as 0.3, obviously incurring a rounding error for reduced memory constraints.
LoRA fine-tuning, where we optimize a network by fine-tuning only a small portion of the weights.
With regard to the latter, the idea is simple:
The conventional approach to fine-tuning is updating all the parameters in a network to train a model on the new set of data, which is very expensive considering we are talking about billions of parameters updated multiple times.
In LoRA, you only update a very small portion by benefiting from the fact that most downstream tasks an LLM performs are intrinsically low-rank.
But what does that mean?
The rank of a matrix is the lowest number of rows or columns that are linearly independent, meaning that you canāt reconstruct them by combining other rows or columns.
In simple terms, it represents a measure of the information redundancy in a matrix, meaning that those rows/columns that are linearly dependent do not add additional information.
In other words, to get optimal results we just need to optimize the small meaningful portion of the weights.
So what do we do?
Well, we take the matrix of weights of the model and decompose it into two matrices that represent its low-rank equivalents.
As you can see below, we can decompose a 5-by-5 matrix into two matrices 2-by-5 and 5-by-2 (with 2 being the rank of this particular matrix).
Source: EntryPoint.ai
Consequently, we do not update the full-sized matrix, but the low-rank ones, reducing the number of parameters to update from 25 to 10 (in this particular case).
This works because in most cases LLMs have much higher parameters than needed for any given task, meaning we donāt have to update the whole matrix to obtain the same results.
And this benefit compounds incredibly well as you can see in the image below, where even when you have a rank of 512 (compared to 2 as we saw before) you are still only updating 1.22% of the total weights of the model while still getting top improvements.
Source: EntryPoint.ai
Adding to this, as the original base model has been quantized, the memory requirements for handling the base model, in this case Mistralx7B, drop dramatically, increasing the overall impact.
However, we are still having the biggest problem: every fine-tuned model requires its dedicated GPU.
And this is where LoRAX comes in.
One GPU, 100 LLMs
The idea behind LoRAX is that, as each adapter adds its own set of weights to the original model, the modelās base weights are the same independently of the fine-tuned version used, as long as all fine-tunings are based on the same base model (in this case, Mistralx7B).
Consequently, what LoRAX allows is to efficiently manage a set of fine-tuned models that are loaded and evicted dynamically from one single GPU depending on the types of requests users send:
Source: Predibase
For a more detailed explanation of LoRAX, read here.
In concise terms, depending on the different requests the users send, the model automatically detects what adapters (fine-tuned models) are required in any given case, and loads their weights into the base model (which remains unchanged as we described earlier).
Source: Predibase
And what does all this sum up to?
Well, for an average approximate price of 200 dollars (8 dollars per fine-tuning), you have a set of 25 models that individually outcompete GPT-4 in their given tasks, while running in one single A100 GPU, with all the added benefits of having total control over your models.
Open-sourceās āit momentā?
All in all, PrediBase might just have given us a glimpse of the future of Enterprise GenAI, as itās becoming harder and harder to look away from open-source as the primary solution for companies willing to embrace the GenAI revolution, considering the quality/price ratio they are starting to offer.
Try LoraLand models for free here.
š¾ Best news of the week š¾
š The age of sovereign AI by Jensen Huang, CEO of NVIDIA
š„ Leaders š„
Understanding Googleās Huge Racist Controversy
Once again, Google has proven that they are as equally capable of delivering generational software like Gemini 1.5, as destroying its reputation by making Geminiās current version outright racist and immoral.
All in the same week.
Unsurprisingly, people all around the world are saying Gemini is racist and immoral.
However, thatās not accurate.
As a matter of fact, the issue is much, much worse, to the point that proprietary models like Gemini or ChatGPT could end up being totally unusable.
Today, we are getting into the weeds on how a mixture of uncanny strategic decisions, covert racism, and the justification of acts like pedophilia, has caused Google to shut down Geminiās image generation capabilities in one of the major flops in AI history.
And more importantly, we will uncover what really caused this.
Subscribe to Leaders to read the rest.
Become a paying subscriber of Leaders to get access to this post and other subscriber-only content.
Already a paying subscriber? Sign In