Astonishing breakthrough cuts LLM training costs by 100x!
đ TheTechOasis đ
Breaking down the most advanced AI systems in the world to prepare you for your future.
5-minute weekly reads.
Before we delve into this breakthrough, let me celebrate something with you guys, đ.
This week, I was mentioned by Adrienne Gibbs, Director of Creator Growth @ Medium, in her piece about the importance of AI in the Medium platform and in our lives, part of the âWhat Weâre Readingâ issue Mediumâs staff recurrently publishes to their 75 million followers.
Not mean to boast here, just wanted to simply express how happy I am when people point to me as a, if I may, âreferenceâ in the field of AI, the industry I love and to which I dedicate countless hours of study and research every week of my life.
And the fact sheâs doing so is thanks to you guys.
đ€Ż AI Research of the week đ€Ż
Turning off the âI appreciate you guysâ mode to âvalueâ mode, today weâre talking about something big.
No super-mega-amazing-world-changing released LLM, no mind-boggling AI-generated images or 3D renders.
Thereâs plenty of room for that in our newsletter, but today weâre talking about money.
âUgh, irrelevant and boringâ, you may think.
Well, actually quite the contrary, as thereâs no industry in the world today more heavily influenced by money than AI.
It drives every decision, both technical and strategic.
And this week a group of researchers have announced an unprecedented feat.
The elephant in the room
âCosts, man.â
Thatâs my bet on whatâs probably coming across Sam Altmanâs mind, the CEO of OpenAI, every waking hour.
In fact, according to Analytics India Magazine, the company could actually go bankrupt by 2024.
I donât believe for a second that will happen, as OpenAI is already âen routeâ to $1 billion in revenues, but the $700,000 costs per day (previous link) of running the models are a statement of how problematic costs are in the world of AI.
Be that training or inferencing a model, be that your bottleneck is computation (training) or memory (inferencing)⊠costs are huge.
Taking Microsoft Azure prices as a reference, running an 8-GPU NVIDIA A100 cluster on their cloud costs around 27 $/per hour. Considering that big models run for months at times during training, and with much, much bigger clusters, the costs rise at an unprecedented scale.
For reference, the first family of LLaMa models, burned $5 million in just 21 days when training their 65-billion-parameter model.
GPT-4 is reportedly a trillion-parameter-sized model, so Sam Altman wasnât really exaggerating when he estimated the total cost at around $100 million.
Yes, you read that right. And he actually said it was âmore than thatâ.
And as size has only proven to be hugely correlated with increased quality (and showing no signs of saturation) the temptation to get big is natural, so the world is off to the races to burn. more. money.
But do we need to? Well, these researchers say no.
The problems of todayâs AI
There are several problems in todayâs AI training methods that these researchers have touched base on.
The first of those is costs. Costs of training or running a model are prohibitive for anyone in the world besides a few companies and governments.
The second topic is extrapolation. Current top LLMsâ performance falls off a cliff whenever you send them text sequences longer than what theyâve seen in their training. To prevent this, companies like OpenAI simply donât allow you to do so.
The third topic is LLM evaluation. Researchers conclude that current evaluation procedures are not representative at all of what building âintelligenceâ is, as we evaluate the âintelligenceâ of a model based on what it knows and how good it is at performing the task it was designed for.
Therefore, the objectives of the researchers were pretty straightforward:
For point 1, prove that we can build huge models at costs orders of magnitude below the current standard.
For point 2, assemble the first model with performance not affected by longer sequence modeling.
For point 3, the hardest of all, propose a new evaluation method that really evaluates the âintelligenceâ of the model, potentially instilling a new standard for LLM evaluation.
Consequently, weâre talking about a research paper that is very ambitious, and were it to deliver its objectives, would make it a canonical piece in AI research.
So what did researchers do?
Building for AGI requires better economic incentives
As we discussed last week, open-sourceâs value proposition is very interesting, but hardly attainable today.
The fact that Meta is given open license and access to their models doesnât neglect the fact that youâre going to need millions of dollars to run them and a huge GPU cluster to simply store the file and, whatâs worse, run it at scale.
Here, the researchers propose an aggressive growth strategy.
In laymanâs terms, instead of training three models at different sizes from scratch, which is the way companies like Meta or OpenAI work (training every model from zero), the paper presents FLM-101B, a 101-billion-parameter model created using a growth strategy that:
first builds a 16-billion model,
then a 51B model extended from the initial 16B one,
and finally, the 101B one based on the previous two.
By doing so, they dramatically reduce the time required to build a performant, huge model from 48 GPU days to 22 days.
But how much cost savings is that?
Well, for starters, besides the fact that you reduce net GPU training days, you also reduce the need for training data for every extension.
For instance, similar-sized models will train from scratch the model with, letâs say, a dataset of 300 billion words.
With FLM, they were in the same range but following a scaled approach:

As you can see in the image above, most of the training data is used with a considerably smaller model (the 16B model), and the architecture is extended to bigger sizes using much less data.
Put simply, the amount of time your model is being trained with 300 billion tokens (around 250 billion words) is considerably smaller, thus the number of floating point operations (FLOPs, the standard way of measuring GPU usage and cost) is considerably smaller.
But why does this work?
Well, because recently it was proven that function-preservation is a thing.
In laymanâs terms, this means that the knowledge and capabilities that the smaller, 16B model learns, are transferred to larger ones in the same training process, meaning that the newer weights added when extending the model size depart from a solid knowledge base.
The result?
While training from scratch a 101B model will come up at a cost of around 10 million dollars, FLM-101B cost $100,000. Yes, thatâs an 100x reduction.
Suddenly, building huge models is economically viable.
By the way, in case youâre wondering how we know that the newer model is effectively absorbing the previous modelâs knowledge, pay attention to the curve steepness in the previous graph.
The fact that the modelâs loss falls faster with every size extension but with less data means two things:
Bigger models learn better (we already knew that)
This bigger model is effectively using the previous knowledge to, indeed, learn more complex representations over its training data.
But researchers also focus their efforts on extrapolation.
First-ever use of xPos
Regarding extrapolation, researchers used xPos positional embeddings.
Positional embeddings are critical elements of transformers like ChatGPT, as these models ingest all the sequence at once.
As the order of words in a sentence obviously influences the correctness and meaning of a sentence, we add positional information to the tokens before they are inserted into the model.
Based on RoPE, rotary positional embeddings, a relatively new technique that instills sequence position to a text token by rotating the vector in polar form, xPos is a proven method to ensure that positional embeddings do not affect model performance in extrapolation scenarios (scenarios when the user sends the model a larger sequence of text than expected).
Extrapolation is a key element in humanityâs path to AGI, because just like humans are capable of extrapolation, so should machines.
But, probably, the most interesting contribution of this paper is their new proposed methodology for LLM evaluation to prove machine intelligence.
Knowledge isnât proof of intelligence
One of the most critical questions in AI today is how do we humans prove if a model is reasoning an answer or regurgitating it.
In other words, has the model memorized that answer, or is it actively reasoning it?
Naturally, if we evaluate a model based on what it knows, thereâs no way of knowing if the model is memorizing or not. Thus, these researchers propose a new way of evaluating the âintelligenceâ of a model: evaluating it against questions that we know it doesnât know.
Hence, the road to intelligence is not building the model that âknows it allâ but the model that learns to generalize to any distribution shift.
In other words, by evaluating the model in tasks that require reasoning processes not possibly seen in the training phase, youâre forcing the model to actively reason its answer.
For instance, they propose several new ways of evaluating such as âanti-interferenceâ.
This is an evaluation method that forces the model to perform well in very noisy scenarios (with a lot of irrelevant data) forcing the model to really understand the task at hand and the data that is really relevant, with examples such as the ones below:

A fresh air for the industry
Donât get me wrong, not many of you are going to use FLM-101B in your life, because it was not created for that purpose.
These researchers had another idea in mind, which is to elevate this industry to new heights by proposing a new training method that is, finally, viable, and bring into the limelight several intuitions about LLM intelligence evaluation that will surely be taken into account in future works.
Finally, the fact that this paper comes from China is really positive, showcasing that itâs finally opening itself by spreading AI knowledge into the world.
Because this industry should build bridges, not walls.
𫥠Key contributions đ«Ą
The first-ever proof that we can build huge models with very low investments
The first successful implementation of xPos positional embeddings, an innovation that could reinforce model extrapolation, a key step for AGI
A new methodology to evaluate true intelligence in machines from a knowledge-base method to a generalization one (if you couldnât have known something by heart, youâre forced to reason it).
đź Practical implications đź
Instead of building billion-dollar-in-cost models that know everything there is to know (no user or company actually needs that), we should build smaller models with proven reasoning capabilities (high IQ) to then tailor to specific domains. Models with less general knowledge, but higher IQ and domain specificity
This could be a deviation from the path led by ChatGPT, a step in the direction of domain-specific models, which could seriously affect future valuations of big players
đŸ Best news of the week đŸ
đ§Â Adept releases Persimmon, the best under-10-billion-parameter open-source model
đ€ HuggingFace releases LLM training service for companies
đ§đœâđ« OpenAI releases ChatGPT guide for teachers
đ While the best image generation model, MidJourney, sucks at typography, IdeoGram is amazing. Try it!
đ„ Leaders đ„
This weekâs issue: My âholy grailsâ of learning, the ultimate bible with all you need to stay ahead of the curve in AI.
As you know, I am now fully committed to my Leaders subscription, part of this newsletter.
If youâre a usual reader of my Medium content you will have noticed that my proliferation over that platform has decreased in favor of more content in my newsletter.
Posting less content on Medium means Iâm losing a lot of revenue, but itâs a qualitative investment for me.
Iâm not leaving Medium anytime soon, donât worry, but I want to be less focused on going super viral, and more focused on delivering value.
Therefore, in todayâs issue, Iâm going to distill to you all the main sources of great information Iâve gathered over time.
Those include:
The best courses in the world (ranked by complexity level) to speed up your training
Hidden gems that few know of with the key timeless concepts that remain no matter how much AI evolves
Key pieces to make you reflect on the critical times weâre living in by some of the most influential people in the world
How to apply AI
And for those wanting to invest in AI, the point of view of those pouring the money in, as the finance folks have brilliant insights that will shape how you invest or build the companies of the future around AI.
Letâs go!
Disclaimer: Iâm not affiliated with any of the people mentioned or the companies shared. This is purely based on my own experience.
Courses
Andrej Karpathy, the Slovakian genius
Weâre starting strong guys, with what may be one of the brightest minds of our time, Andrej Karpathy.
An Eastern European Prodigy and founder of OpenAI back in 2015, he was poached by Elon Musk to lead their autonomous driving technology for many years. Today, besides his role as a professor at Stanford, he also has returned to OpenAI.
Andrej is also a rare sight, because besides being a literal genius, heâs also great at teaching.
And when I say great, I MEAN GREAT.
This is weird because geniuses tend to be the worst teachers in the world, as their superior brains fail miserably to ground into the common folk.
May you find below a distilled list of the best of the best in his repertoire:
Links:
The State of GPT: A comprehensive, 40-minute keynote by Andrej in the Microsoft Build conference of 2023. Key concepts:
Youâll learn how Conversational AI is built
Tips and tricks to better use these models
Key intuitions regarding the actual understanding of what âChatGPT really isâ
Intro to Transformers: An hour-long video of his class at Stanford regarding Transformers.
Zero to Hero: His ultra-deep, 10+ plus hours zero to a hundred crash course on deep learning. It includes all the intuitions needed to understand how LLMs are really built. It doesnât get any better than this.
Andrew Ng, the Godfather of DeepLearning
Following another group of amazing courses, we could not forget about another of the larger-than-life Deep Learning godfathers, Andrew Ng.
Besides its undeniable contributions to fostering Deep Learning adoption among big tech companies, Andrew Ngâs commitment to educating about AI is commendable and incredibly powerful for you.
Subscribe to Leaders to read the rest.
Become a paying subscriber of Leaders to get access to this post and other subscriber-only content.
Already a paying subscriber? Sign In