TheTechOasis
Posts
OpenAI Changes the Game & The Arrival of the King: LLaMa 3.1

OpenAI Changes the Game & The Arrival of the King: LLaMa 3.1

Ignacio de Gregorio Noblejas
July 25, 2024

THEWHITEBOX
TLDR;

🗞️ News from Numina, OpenAI, Mistral, & SuperMicro.
📚 New short courses, instant LLMs, & more.
💰 OpenAI Changes the Economics
🦙 Meta Releases the New King, Llama 3.1

THEWHITEBOX
Weekly News 🚨🚨

Welcome back!

To start the newsreel strong, it’s not common to have the greatest mathematician alive, Terence Tao, be ‘surprised’ by the results of an AI. But this is what the Numina + HuggingFace team has achieved in the AIMO competition, an Olympic-level problems maths competition.

The main method used was fine-tuning the model with reasoning-augmented data. In other words, just like we covered in our piece ‘Is AI really intelligent?’, one of the most promising avenues of reasoning research is creating synthetic data that helps models learn better reasoning thought processes.

Casually, this method has also been used in training the new supermodel LLaMa 3.1 (more on that below).

Moving to hardware, Supermicro, an AI hardware company that produces racks, liquid cooling, and other essential components that GPUs need to work, claims that its new cooling system ensures up to 40% less power consumption for an equal budget.

Nonetheless, the company has grown 200% year-over-year and is expected to grow an additional 152%.

You would be surprised how important these systems are and how much energy they require. For reference, an NVIDIA H100 GPU has a Thermal Design Power (TDP) of 700W (the power it needs to work).

Yet, the total per-GPU consumption of H100 clusters is on average double that value, as all the additional components (like networking or cooling), add almost another 700W to your bill.

With the new Blackwell GPUs intented primarily for liquid cooling (you could even say required, as Blackwells will reportedly increase the per-rack KW demand to 40KW), SuperMicro is having a lovely time these days.

Finally, if Meta’s blowout release below wasn’t enough, Mistral announced the release of Mistral 2 Large yesterday.

At just 123 billion parameters, Mistral claims it’s on par with Llama 3.1 405B, GPT-4o, and Claude 3.5 Sonnet on math and code while being much more efficient (or pareto-optimal, the corny way researchers describe this).

This release must not be ignored, as the model efficiently runs on a single H100 node (8 GPUs) despite having frontier-level capabilities.

This means the model’s entire GPU foundation communicates through NVIDIA’s 900 GB/s NVLink bandwidth, crucial to reduce latency (node-to-node communication is done through NVIDIA Infiniband, dropping communication bandwidth to 25 GB/s, or 36 times slower).

In other words, staying under 8 GPUs is essential. While at first Llama 3.1 405B might need two nodes (or 16 GPUs) at FP16 precision (810GB), Meta released a quantized FP8 version that occupies 405B, which means Llama 3.1 405B can also be run on a single node.

The era of efficient frontier models has arrived.

LEARN
The Insights Corner

🧐 The Creator of Mamba and how it could change the AI paradigm

😍 Want to learn how to pre-train an LLM? You can do so with Deeplearning.ai’s newest short course.

🤩 Witness Llama 3.1 8B Instant, the fastest model ever.

🦙 Open-Source is the path forward, by Mark Zuckerberg

🦙 Model lead for Llama 3.1 speaks on the experience and process, Thomas Scialom, by The Latent Space Podcast

	Sponsored simple.ai - The Agent AI newsletterJoin 200,000+ others and learn how to use Agent AI to grow your career or business.

ECONOMICS
OpenAI Changes the Economics Game

Incredibly, Generative AI might have just become absurdly cheap. OpenAI has released a new model, GPT-4o mini, with unit economics that are, quite frankly, irresistible.

But is this enough to finally incentivize companies to use Large Language Models?

What I can tell you is that you are going to be blown away.

Impressive is an Understatement

When one considers the speed at which GPT-4o mini works and its power, it’s almost hard to believe.

According to preliminary results, the model is already as good as GPT-4-Turbo, sitting at 2_nd place overall, on par with GPT-4o, Claude 3.5 Sonnet, and Gemini, the holy trinity.

But the most impressive thing becomes clear when you compare its performance relative to its size; this is the most powerful model ever created on a pound-per-pound basis.

For reference, Claude 3 Opus, rumored to be bigger than 2 trillion parameters in size, could turn out to be worse than GPT-4o mini, signaling how fast the pace of innovation in the industry continues to be.

While the intelligence barrier has remained largely unchanged for at least two years (probably because the average training FLOPs have not increased since GPT-4), the performance relative to size, also known as model compression, has dramatically improved in the last few months.

But how cheap is GPT-4o mini at scale? Here’s a quick way to evaluate your future projects using API-based models.

A New State of Economics

Before we delve into the economics, here's a quick review of how LLM APIs are charged. LLMs are sequence-to-sequence models, meaning that you input a sequence, usually text, and they output a sequence, usually a response or continuation to the input.

Albeit rare exceptions where providers charge the same price (usually subsidizing output token costs to attract more customers), output tokens are usually more expensive than input tokens because processing is cheaper than generation.

In the processing phase, also known as the prefill phase, the model ingests the entire sequence and generates the KV Cache, which it uses to make the next generations more efficient.

This means the model does much more work to produce the first output word, but the heavy lifting is done once, thanks to the cache. Moving forward, successive generations are faster and cheaper to generate.

However, as the generation process may be executed hundreds of times per user request, generation becomes more expensive in the long run, especially considering that users have much more control over how long or short they want the output to be.

But what are these values in GPT-4o?

GPT-4o now costs $0.15 per 1 million input tokens and $0.6 per 1 million output tokens. For reference:

That is 33 and 25 times cheaper than the state-of-the-art, GPT-4o,
20 and 25 times cheaper than Claude 3.5 Sonnet, the state-of-the-art for coding, and
66 and 50 times cheaper than GPT-4-Turbo despite being a better model

Not bad.

Most enterprise LLM software also offer serverless modes for open-source models that are also priced per token, meaning this is not unique to private models.

But can we get actual business case numbers? Well, yes!

Let’s say a company has 100 users who want to give access to GPT-4o mini. Then, assuming:

An average of 10 daily requests with 20 working days per month,
An average input sequence size of 3000 words (let’s assume the users share PDFs with the model) or 3,500 tokens,
And an average response size of 1000 words (assuming proper prompt engineering to ensure the model isn’t too verbose) or 1,300 tokens,

This gives an annual cost of the service of:

100 users x 10 × 20 = 20k requests a month
Total amount of input tokens per month = 20k * 3.5k tokens = 70 million
Total output tokens per month = 20k * 1.3k tokens = 26 million
Total month costs = (70 × 0.15) = $10.5 + (26 × 0.6) = $15,6 → $26.1
Total annual costs = $26.1 × 12 = 313.2 dollars.

Read that again: 100 users using GPT-4o mini 10 times a day for a year would only cost 313 dollars.

And what about potential savings?

Assuming that every request saves the user 2 minutes of work (we are being very conservative; it’s probably more), with 240k requests a year, that would be 480k minutes, or 8k hours.

In Full-Time Equivalents, an average person's dedication in a year of 1.7k hours (11 months, 20 work days at 8 hours/day), that is 4.5 FTEs.

At an average US salary of $50k, that’s $227.000-ish. Finally, assuming professional services of $100k to set up the service, that’s +100k in savings in the first year alone, with implementation payback in just six months and +$200k plus savings in year two.

And they told us we shouldn’t be worried about our jobs?

TheWhiteBox’s take

Technology:

We can only guess, but things like quantization to drop the model’s memory requirements and a massive mixture of experts to reduce latency and forward-pass FLOP requirements are most definitely two techniques they used to create GPT-4o mini.

Products:

The economics are staggering. However, a problem remains. Enterprises don’t need a good model at many things that are great at none. They need a model that’s great at one thing, which most LLMs can’t offer today.

The solution?

Fine-tuning, which they also happen to have announced after Meta’s Llama 3.1 announcement, even offering a free daily tier of 2 million tokens.

If they really open up their fine-tuning offering and provide enough flexibility, combined with their great economics, OpenAI might be ripe to finally convince enterprises (unless Llama 3.1 below prevents that).

Markets:

At first, you may be tempted to think that better unit economics should yield fewer capital expenses for big tech companies in future quarters.

But I don’t expect that to be the case, as training smarter models will become more expensive over time (also, there’s a non-zero chance they are heavily subsidizing these costs). OpenAI is reportedly losing $5 billion this year alone.

Nonetheless, this could indeed be a catalyst for enterprise adoption, which should increase revenues over time and justify the inflated valuations we see today.

FRONTIER RESEARCH
LLaMa 3.1, Open-Source’s Savior

In a major turn of events, history has been made.

Meta has finally released its largest model ever, LLaMa 3.1 405B, as well as its smaller brothers 8B and 70B, to change the AI landscape completely:

In a nutshell, open-source has finally caught up with the holy trinity: GPT-4o, Gemini 1.5, and Claude 3.5 Sonnet.

This is great news for all of us but terrible news for the incumbents as they see the LLM market at high risks of commoditization and, with that, their billion-dollar CAPEX investments in serious risks of never making a return.

But what’s so special about LLaMa 3.1, and why is it so important? There’s too much information floating around, so here are all the key things you need to know about this historic release.

Holy LLaMa

According to self-reported data, it is unapologetically on par with the best models out there: Claude 3.5 Sonnet, GPT-4o, and Gemini, or NVIDIA’s Nemotron, which we covered a few weeks ago:

I mean, the model is so good that its smaller 70B version is almost as good as the holy trinity across most tasks, and we are talking about a model that is close to 30 times smaller than these.

But you don’t have to take their word. Unbiased leaderboards like Scale.ai and lmsys are also championing this model as one of the best (and the best in tasks like instruction following):

Source

The model excels at various tasks, such as math, reasoning, tool-calling (using external tools in their responses), or common sense, matching the best models in every category.

Considering the sheer scale of the information Meta has shared, I will cut to the chase, answering the key questions that might be surging in your mind right now. And, to make it easier for you to digest, on a question-answer structure:

What was the biggest change regarding the previous version?

Data. Algorithmically speaking, the model is almost identical to previous versions and extremely close to the original Transformer paper, signaling how little things have changed ever since.

Surprisingly, they steer away from mixture-of-experts architectures, where we break the model into experts to achieve faster inference, which had become very common (i.e. GPT-4 or Mixtral8×22) due to training instability concerns.

Instead, they released a quantized version for faster inference (more on that below).

However, they have made huge changes to the data pipeline, increasing the amount of synthetic (artificially generated) data to train the model, specifically as a post-training method, to enhance reasoning, coding, or alignment (instruction following tasks and model awareness of when not to respond).

Out of this synthetic data, we need to highlight one of the things I’ve been talking about for quite some time: enhancing the model's reasoning capabilities through data augmentation.

As explained in previous issues of this newsletter, such as OpenAI’s Project Strawberry or our study on AI intelligence, data augmentation emerges as a key new method to enhance model reasoning.

As the links above explain (click on the latter for a deep dive into reasoning methods being studied), a hot new way to enhance reasoning is to create specific datasets that cover multi-step reasoning chains (thought processes to solve complex problems) that the model learns to imitate.

In particular, Meta points out OpenAI’s let’s verify step-by-step as the main method to generate the reasoning-enhanced chains and a combination of Monte Carlo Tree Search with an LLM for more complex reasoning processes to imitate.

Models learn multi-step solution chains to improve reasoning. Source: OpenAI

As for the latter method, it’s using an LLM to search the space of possible solutions until it finds the best one. It’s basically Chain-of-Preference Optimization, which we also covered in the piece on AI intelligence in case you want more detail.

How expensive was it to train the model?

Meta did not disclose the actual numbers, but we can calculate a rough estimate to give you an idea.

Knowing the model trained for 54 days on a 16k NVIDIA H100 GPU cluster, and using an average electricity industrial tariff of 0.083 $/kWh (assuming it was trained in the US), we have:

Cluster: 16k H100 / 8 = 2000 servers (8 GPUs and 2 CPUs per server)
Average consumption per node: 11.1 KW, or 1.3 KW per GPU, according to Semianalysis.
Cluster Power = 2000 nodes x 11.1 KW per node = 22,222,000 Watts, meaning this cluster requires 22.2 MW of power
Training time = 54 days → 22.2MW x 24 hours x 54 days = 28,771 MWh of consumed energy
At an average cost of 0.083 per kW → 28,512,000 kW x 0.083 = $2.4 million.

It may not seem like a lot, but running costs are a minor part of the TCO (total ownership cost). The capital costs of this cluster are huge; at $35k per GPU, this is an investment of $560 million, without counting all other equipment (racks and such), which would elevate that number quite a bit.

On inference, running this 22.2MW data center for a year would cost $16.1 million, although that would entail hundreds or even thousands of instances of the model running simultaneously.

Is the model multimodal?

Well… sort of. In fact, if we look at it from a purist perspective, the answer is no, despite the model having multimodal capabilities.

They trained additional LLaMa 3.1 models (not released yet) that incorporate image, video, and audio/speech capabilities using a compositional method; in other words, they train an LLM first and then connect the multimodal capabilities using adapters.

But what is the difference?

If we compare these upcoming ‘multimodal’ LLaMa 3.1 models with GPT-4o, there’s a huge difference: while GPT-4o can reason across different modalities, aka it really doesn’t care in what form you communicate with it, the LLaMa 3.1 multimodal capabilities are heavily biased to text.

Specifically, using a pre-trained text-based Large Language Model, they connect it to image, video, and audio components through ‘adapters’ that either ‘transform audios into text’ or allow LLMs to ‘pay attention to image and video features.’

In other words, as shown below, you can think of the speech encoder+adapter (right) as taking in speech and transforming it into text, and the image and video encoders (left) as exposers of semantic information that the LLM can use to enhance its context.

Technical specificities:

The speech adapter is a small neural network that transforms audio representations into text embeddings, while the image and video adapter allows the LLM to enrich its context (like when it needs to answer a question about an image/video) through the use of inserted cross-attention layers (these add roughly 100 billion more parameters to the model).

To learn how attention works, read here.

Do they introduce new improvements in hallucinations?

They claim to, but it’s not that special. They focus a lot on improving alignment, aka making sure the model knows what it can answer.

Indirectly, if the model ‘knows what it knows and avoids answering what it doesn’t,’ it will produce fewer hallucinations. But as we’ve talked multiple times, the key to solving hallucinations resides in overfitting to key data through memory adapters.

Are there any other things worth mentioning?

The model displays pretty strong capabilities in coding and multilingual tasks (using various languages). To enhance these, they created ‘experts’ (not to be confused with a mixture of experts).

These experts are checkpoints of the model that are ‘branched off’ from the model and fine-tuned on task-specific data. In layman’s terms, they took versions of the model and, instead of continuing on the usual training pipeline, used them to create fine-tuned versions of the model in tasks like coding or multilingual.

Then, they used these supercharged models to generate higher-quality synthetic data, which they used to fine-tune the global model and enhance their coding/multilingual capabilities without impacting their overall performance.

Overall, it was an absurdly strong release by Meta that might have changed the game, probably forever.