• TheTechOasis
  • Posts
  • Astonishing breakthrough cuts LLM training costs by 100x!

Astonishing breakthrough cuts LLM training costs by 100x!

🏝 TheTechOasis 🏝

Breaking down the most advanced AI systems in the world to prepare you for your future.

5-minute weekly reads.

Before we delve into this breakthrough, let me celebrate something with you guys, 😊.

This week, I was mentioned by Adrienne Gibbs, Director of Creator Growth @ Medium, in her piece about the importance of AI in the Medium platform and in our lives, part of the “What We’re Reading” issue Medium’s staff recurrently publishes to their 75 million followers.

Not mean to boast here, just wanted to simply express how happy I am when people point to me as a, if I may, “reference” in the field of AI, the industry I love and to which I dedicate countless hours of study and research every week of my life.

And the fact she’s doing so is thanks to you guys.

🤯 AI Research of the week 🤯

Turning off the “I appreciate you guys” mode to “value” mode, today we’re talking about something big.

No super-mega-amazing-world-changing released LLM, no mind-boggling AI-generated images or 3D renders.

There’s plenty of room for that in our newsletter, but today we’re talking about money.

“Ugh, irrelevant and boring”, you may think.

Well, actually quite the contrary, as there’s no industry in the world today more heavily influenced by money than AI.

It drives every decision, both technical and strategic.

And this week a group of researchers have announced an unprecedented feat.

The elephant in the room

“Costs, man.”

That’s my bet on what’s probably coming across Sam Altman’s mind, the CEO of OpenAI, every waking hour.

In fact, according to Analytics India Magazine, the company could actually go bankrupt by 2024.

I don’t believe for a second that will happen, as OpenAI is already “en route” to $1 billion in revenues, but the $700,000 costs per day (previous link) of running the models are a statement of how problematic costs are in the world of AI.

Be that training or inferencing a model, be that your bottleneck is computation (training) or memory (inferencing)… costs are huge.

Taking Microsoft Azure prices as a reference, running an 8-GPU NVIDIA A100 cluster on their cloud costs around 27 $/per hour. Considering that big models run for months at times during training, and with much, much bigger clusters, the costs rise at an unprecedented scale.

For reference, the first family of LLaMa models, burned $5 million in just 21 days when training their 65-billion-parameter model.

GPT-4 is reportedly a trillion-parameter-sized model, so Sam Altman wasn’t really exaggerating when he estimated the total cost at around $100 million.

Yes, you read that right. And he actually said it was “more than that”.

And as size has only proven to be hugely correlated with increased quality (and showing no signs of saturation) the temptation to get big is natural, so the world is off to the races to burn. more. money.

But do we need to? Well, these researchers say no.

The problems of today’s AI

There are several problems in today’s AI training methods that these researchers have touched base on.

  1. The first of those is costs. Costs of training or running a model are prohibitive for anyone in the world besides a few companies and governments.

  2. The second topic is extrapolation. Current top LLMs’ performance falls off a cliff whenever you send them text sequences longer than what they’ve seen in their training. To prevent this, companies like OpenAI simply don’t allow you to do so.

  3. The third topic is LLM evaluation. Researchers conclude that current evaluation procedures are not representative at all of what building “intelligence” is, as we evaluate the “intelligence” of a model based on what it knows and how good it is at performing the task it was designed for.

Therefore, the objectives of the researchers were pretty straightforward:

  • For point 1, prove that we can build huge models at costs orders of magnitude below the current standard.

  • For point 2, assemble the first model with performance not affected by longer sequence modeling.

  • For point 3, the hardest of all, propose a new evaluation method that really evaluates the “intelligence” of the model, potentially instilling a new standard for LLM evaluation.

Consequently, we’re talking about a research paper that is very ambitious, and were it to deliver its objectives, would make it a canonical piece in AI research.

So what did researchers do?

Building for AGI requires better economic incentives

As we discussed last week, open-source’s value proposition is very interesting, but hardly attainable today.

The fact that Meta is given open license and access to their models doesn’t neglect the fact that you’re going to need millions of dollars to run them and a huge GPU cluster to simply store the file and, what’s worse, run it at scale.

Here, the researchers propose an aggressive growth strategy.

In layman’s terms, instead of training three models at different sizes from scratch, which is the way companies like Meta or OpenAI work (training every model from zero), the paper presents FLM-101B, a 101-billion-parameter model created using a growth strategy that:

  • first builds a 16-billion model,

  • then a 51B model extended from the initial 16B one,

  • and finally, the 101B one based on the previous two.

By doing so, they dramatically reduce the time required to build a performant, huge model from 48 GPU days to 22 days.

But how much cost savings is that?

Well, for starters, besides the fact that you reduce net GPU training days, you also reduce the need for training data for every extension.

For instance, similar-sized models will train from scratch the model with, let’s say, a dataset of 300 billion words.

With FLM, they were in the same range but following a scaled approach:

As you can see in the image above, most of the training data is used with a considerably smaller model (the 16B model), and the architecture is extended to bigger sizes using much less data.

Put simply, the amount of time your model is being trained with 300 billion tokens (around 250 billion words) is considerably smaller, thus the number of floating point operations (FLOPs, the standard way of measuring GPU usage and cost) is considerably smaller.

But why does this work?

Well, because recently it was proven that function-preservation is a thing.

In layman’s terms, this means that the knowledge and capabilities that the smaller, 16B model learns, are transferred to larger ones in the same training process, meaning that the newer weights added when extending the model size depart from a solid knowledge base.

The result?

While training from scratch a 101B model will come up at a cost of around 10 million dollars, FLM-101B cost $100,000. Yes, that’s an 100x reduction.

Suddenly, building huge models is economically viable.

By the way, in case you’re wondering how we know that the newer model is effectively absorbing the previous model’s knowledge, pay attention to the curve steepness in the previous graph.

The fact that the model’s loss falls faster with every size extension but with less data means two things:

  1. Bigger models learn better (we already knew that)

  2. This bigger model is effectively using the previous knowledge to, indeed, learn more complex representations over its training data.

But researchers also focus their efforts on extrapolation.

First-ever use of xPos

Regarding extrapolation, researchers used xPos positional embeddings.

Positional embeddings are critical elements of transformers like ChatGPT, as these models ingest all the sequence at once.

As the order of words in a sentence obviously influences the correctness and meaning of a sentence, we add positional information to the tokens before they are inserted into the model.

Based on RoPE, rotary positional embeddings, a relatively new technique that instills sequence position to a text token by rotating the vector in polar form, xPos is a proven method to ensure that positional embeddings do not affect model performance in extrapolation scenarios (scenarios when the user sends the model a larger sequence of text than expected).

Extrapolation is a key element in humanity’s path to AGI, because just like humans are capable of extrapolation, so should machines.

But, probably, the most interesting contribution of this paper is their new proposed methodology for LLM evaluation to prove machine intelligence.

Knowledge isn’t proof of intelligence

One of the most critical questions in AI today is how do we humans prove if a model is reasoning an answer or regurgitating it.

In other words, has the model memorized that answer, or is it actively reasoning it?

Naturally, if we evaluate a model based on what it knows, there’s no way of knowing if the model is memorizing or not. Thus, these researchers propose a new way of evaluating the “intelligence” of a model: evaluating it against questions that we know it doesn’t know.

Hence, the road to intelligence is not building the model that “knows it all” but the model that learns to generalize to any distribution shift.

In other words, by evaluating the model in tasks that require reasoning processes not possibly seen in the training phase, you’re forcing the model to actively reason its answer.

For instance, they propose several new ways of evaluating such as “anti-interference”.

This is an evaluation method that forces the model to perform well in very noisy scenarios (with a lot of irrelevant data) forcing the model to really understand the task at hand and the data that is really relevant, with examples such as the ones below:

A fresh air for the industry

Don’t get me wrong, not many of you are going to use FLM-101B in your life, because it was not created for that purpose.

These researchers had another idea in mind, which is to elevate this industry to new heights by proposing a new training method that is, finally, viable, and bring into the limelight several intuitions about LLM intelligence evaluation that will surely be taken into account in future works.

Finally, the fact that this paper comes from China is really positive, showcasing that it’s finally opening itself by spreading AI knowledge into the world.

Because this industry should build bridges, not walls.

🫡 Key contributions 🫡

  • The first-ever proof that we can build huge models with very low investments

  • The first successful implementation of xPos positional embeddings, an innovation that could reinforce model extrapolation, a key step for AGI

  • A new methodology to evaluate true intelligence in machines from a knowledge-base method to a generalization one (if you couldn’t have known something by heart, you’re forced to reason it).

🔮 Practical implications 🔮

  • Instead of building billion-dollar-in-cost models that know everything there is to know (no user or company actually needs that), we should build smaller models with proven reasoning capabilities (high IQ) to then tailor to specific domains. Models with less general knowledge, but higher IQ and domain specificity

  • This could be a deviation from the path led by ChatGPT, a step in the direction of domain-specific models, which could seriously affect future valuations of big players

👾 Best news of the week 👾

🧐 Adept releases Persimmon, the best under-10-billion-parameter open-source model

🤗 HuggingFace releases LLM training service for companies

🧑🏽‍🏫 OpenAI releases ChatGPT guide for teachers

😍 While the best image generation model, MidJourney, sucks at typography, IdeoGram is amazing. Try it!

🥇 Leaders 🥇

This week’s issue: My ‘holy grails’ of learning, the ultimate bible with all you need to stay ahead of the curve in AI.

As you know, I am now fully committed to my Leaders subscription, part of this newsletter.

If you’re a usual reader of my Medium content you will have noticed that my proliferation over that platform has decreased in favor of more content in my newsletter.

Posting less content on Medium means I’m losing a lot of revenue, but it’s a qualitative investment for me.

I’m not leaving Medium anytime soon, don’t worry, but I want to be less focused on going super viral, and more focused on delivering value.

Therefore, in today’s issue, I’m going to distill to you all the main sources of great information I’ve gathered over time.

Those include:

  • The best courses in the world (ranked by complexity level) to speed up your training

  • Hidden gems that few know of with the key timeless concepts that remain no matter how much AI evolves

  • Key pieces to make you reflect on the critical times we’re living in by some of the most influential people in the world

  • How to apply AI

  • And for those wanting to invest in AI, the point of view of those pouring the money in, as the finance folks have brilliant insights that will shape how you invest or build the companies of the future around AI.

Let’s go!

Disclaimer: I’m not affiliated with any of the people mentioned or the companies shared. This is purely based on my own experience.


Andrej Karpathy, the Slovakian genius

We’re starting strong guys, with what may be one of the brightest minds of our time, Andrej Karpathy.

An Eastern European Prodigy and founder of OpenAI back in 2015, he was poached by Elon Musk to lead their autonomous driving technology for many years. Today, besides his role as a professor at Stanford, he also has returned to OpenAI.

Andrej is also a rare sight, because besides being a literal genius, he’s also great at teaching.

And when I say great, I MEAN GREAT.

This is weird because geniuses tend to be the worst teachers in the world, as their superior brains fail miserably to ground into the common folk.

May you find below a distilled list of the best of the best in his repertoire:


  • The State of GPT: A comprehensive, 40-minute keynote by Andrej in the Microsoft Build conference of 2023. Key concepts:

    • You’ll learn how Conversational AI is built

    • Tips and tricks to better use these models

    • Key intuitions regarding the actual understanding of what ‘ChatGPT really is’

  • Intro to Transformers: An hour-long video of his class at Stanford regarding Transformers.

  • Zero to Hero: His ultra-deep, 10+ plus hours zero to a hundred crash course on deep learning. It includes all the intuitions needed to understand how LLMs are really built. It doesn’t get any better than this.

Andrew Ng, the Godfather of DeepLearning

Following another group of amazing courses, we could not forget about another of the larger-than-life Deep Learning godfathers, Andrew Ng.

Besides its undeniable contributions to fostering Deep Learning adoption among big tech companies, Andrew Ng’s commitment to educating about AI is commendable and incredibly powerful for you.

Subscribe to Leaders to read the rest.

Become a paying subscriber of Leaders to get access to this post and other subscriber-only content.

Already a paying subscriber? Sign In

A subscription gets you:
High-signal deep-dives into the most advanced AI in the world in a easy-to-understand language
Additional insights to other cutting-edge research you should be paying attention to
Curiosity-inducing facts and reflections to make you the most interesting person in the room

Subscribe to keep reading

This content is free, but you must be subscribed to TheTechOasis to continue reading.

Already a subscriber?Sign In.Not now