- TheTechOasis
- Posts
- AI Still Can't Plan, OpenAI's Drama, Llama 3.2, & more
AI Still Can't Plan, OpenAI's Drama, Llama 3.2, & more
THEWHITEBOX
TLDR;
🧐 OpenAI is a Never-ending Drama
🦙 Meta Presents Llama 3.2 Models
📰 Google Releases New Gemini 1.5 Pro Version
🤩 GenAI, Faster Adopted Than The Internet
🫢 Allen AI Presents Molmo, the Best Open-Source Model Ever
👑 The Agentic SaaS Wars Are Here
🥳 AI Identifies Unexpected Deaths Before they Take Place
[TREND OF THE WEEK] AIs Still Can’t Plan.
All your news. None of the bias.
Be the smartest person in the room by reading 1440! Dive into 1440, where 3.5 million readers find their daily, fact-based news fix. We navigate through 100+ sources to deliver a comprehensive roundup from every corner of the internet – politics, global events, business, and culture, all in a quick, 5-minute newsletter. It's completely free and devoid of bias or political influence, ensuring you get the facts straight.
NEWSREEL
OpenAI Just Loves Drama
One week passes, one week that OpenAI is surrounded by controversy. Today, our previous news that OpenAI might be going full for-profit became a reality, according to Bloomberg, with Sam Altman allegedly getting 7% of the company that, let’s not forget, was once born as a non-profit company.
Coincidentally, yesterday, the company’s CTO, Mira Murati, announced her departure. In the meantime, they are finally releasing the highly-awaited Advanced Voice Mode, but it’s only available to specific countries that, unsurprisingly, don’t include the EU.
TheWhiteBox’s take:
Eleven people founded OpenAI, but only three remain: Sam Altman, Greg Brockman (on sabbatical leave, though), and Wojciech Zaremba. Many other executives are leaving, too, probably unhappy with the company's direction.
It must be said it was probably inevitable; AI requires humongous amounts of cash, and things will only get worse. Humans are incentive machines, so nobody will pour trillions into a non-profit.
But probably the worst thing here is how they are doing this transition. Four months ago, Sam Altman said he didn’t need the money, which is why he didn’t have a stake in the company. How the turn tables.
OPEN-WEIGHTS
Meta Presents Llama 3.2 Models
Meta has taken a turn for multimodal and edge devices with Llama 3.2 models, a new family of LLM/MLLMs meant to be run in a more affordable fashion.
Two small models (1 & 3 billion parameters, respectively) are ‘ideal’ for edge devices and optimized for ARM processors (the overwhelming processor type in smartphones and other devices). These models offer comparable performance to models their size. These are text-only.
Two mid-sized models (11 & 90 billion respectively), with the latter offering comparable performance to GPT-4o mini. These are multimodal, allowing the processing of images apart from text.
They have also released the Llama Stack API, supported by many providers already, making the overall experience of working with Llama models a great one. With Llama models becoming a ‘developer platform’ for Generative AI, Llama Stack could soon be a ‘must-have’ skill for developers.
TheWhiteBox’s take:
Couple of points to go beyond the hype.
There seems to be a convergence toward 70-90 billion parameter-sized models. . Marketing aside, the edge models aren’t that small, at least not compared to the hardware they are allegedly appealing to.
The 3 billion parameter model, trained at BF16 (as confirmed by the blog post), requires 6 GB of RAM just to store the model. An iPhone 16 has just 8 GB, so leaving 2 GB for all other programs running on the device is not feasible. Google’s Pixel family reaches 16GB, so there’s a greater chance it fits. Still, edge-LLMs should aim for 2GB of RAM at most, which would be the case for Llama 3.2 1B, but it’s text-only, so there’s still room to making them smaller.
The most interesting thing about these edge devices is the training method. Starting from Llama 3.1 8B, they first pruned the model (eliminating unnecessary weights) and then performed distillation training, where the larger model acts as the teacher and the smaller model acts as a student. By learning to imitate the teacher's responses, the student reaches a ‘similar’ performance while being considerably smaller.
FRONTIER
Google Updates Gemini 1.5 Pro
Google has updated its Gemini AI models, introducing Gemini-1.5-Pro-002 and Gemini-1.5-Flash-002. Key changes include over 50% price reductions, increased rate limits, and performance boosts in tasks like math, code, and vision.
Developers benefit from faster outputs, reduced latency, and improved model response quality.
TheWhiteBox’s take:
As Google struggles to clinch any meaningful market share (only 2.1%), despite having a model that is basically on par with OpenAI’s GPT-4o models, it is committed to overturning this trend. The new model is a considerable improvement from May’s version, and it has also cut prices considerably, becoming the cheapest option across all frontier models.
A word of caution: This model is competing with GPT-4o class models; o1 models are entirely different, leading to some arguing that they are no longer LLMs but LRMs (Large Reasoner Models).
However, Gemini 2.0 will be an LRM.
DEMAND
GenAI Adoption is The Fastest Ever
According to a new study, despite the slow adoption, Generative AI seems to be adopted faster than PCs and the Internet.
TheWhiteBox’s take:
While adoption rates are valuable, they don’t include the cost of adoption. Generative AI is the most expensive technology ever created, and returns of investments are dozens of multiples of justifying the investments.
As we’ll cover in this week’s ‘Trend of the Week’ below, we need urgent attention to technology's efficiency, which is already showing very scary signs of spiraling out of control.
OPEN-SOURCE
Molmo, A Frontier Multimodal LLM
The Allen Institute of AI has presented Molmo, a truly open-source model (both the weights and the dataset will be published) that achieves state-of-the-art performance across several benchmarks.
Besides, it includes a fascinating feature that allows the model to pinpoint the object you ask it (above image), adding an extra layer of granularity to the already very-powerful vision features.
TheWhiteBox’s take:
A common misconception regarding Llama models is that they are open-source models. However, they are not, they are open-weights. Unlike The Allen Institute, Llama doesn’t publish its dataset.
As AI models are simply learning data distributions, not knowing what that distribution actually looks like is suboptimal, as the moment you want to fine-tune the model further, not knowing the original distribution (what the model originally knows) prevents you from assembling a fine-tuning dataset that doesn’t cause catastrophic forgetting.
The great advantage of open-source over private models is that the entire AI community sits behind them. While they don’t have the same access to capital, the combined human talent leads to eventually better outcomes. All technologies in history have been open-sourced, so let’s hope AI continues down the same path.
AGENTS
The Agentic Wars Are Here
As you may have guessed by now, agents are the next big thing in the Generative AI industry. That is, providing our models with the capacity to take action.
While the results are still premature, and we’ll adress in future Leader segments (I’m preparing a rundown on the impact of AI on the SaaS industry, potentially killing most companies), many of these companies aren’t waiting for AI to kill them, and are trying to ‘embody’ this transitions themselves.
As reported by The Information, companies like Microsoft, Salesforce, ServiceNow, or Workday are planning fully-fledged agent releases by next year.
This Sunday, we are covering the soon-to-be-hottest jobs in an AI world, some of which are in agents' direct trajectory. Stay tuned.
HEALTHCARE
AI Identifies Unexpected Deaths
New research finds that a machine-learning model, ChartWatch, reduces unexpected hospital deaths by 26%. And, huge news: it’s not ChatGPT!
Jokes aside, while everyone is looking toward Silicon Valley, people with few resources and access only to years-old technology are saving lives. The method is a logistic regression that matches a set of patient parameters to a prediction of ‘patient deterioration,’ allowing nurses and doctors to take preemptive action before it’s too late.
In first principles, that’s not so different from any other AI in the world (yes, that includes the fancy ones). They are all pattern matchers that identify key signals in data before humans do, leading to outcomes like those of one of the patients, who got proactively diagnosed with cellulitis, potentially deadly because the AI identified white cells were ‘too high’ before the deadly simptomps appeared.
It turns out you don’t need giga-scale data centers to have AI save lives.
TREND OF THE WEEK
AI Still Can’t Plan
In an industry as hyped as Generative AI, it’s more important to know the things that don’t work than those that do. However, knowing what’s true and provable and what’s empty hype is not something that standard media or incumbents will let you see.
Therefore, pieces like today’s go largely unnoticed, as these labs and corporations need society to believe AI is as ‘smart as a PhD,’ while, as you’ll see today, that’s simply a despicable lie.
And after all the industry went full hype-mode with the release of OpenAI’s o1 models that allegedly excel at reasoning, the first research-based results are starting to come out and, well, it’s a bag of mixed feelings.
Regarding one of the key features AI must conquer, planning, the verdict is clear: AI still can’t plan. In the process, they are primitive, expensive, and, most concerningly, brilliant deceivers, leading us to the trillion-dollar question:
Are foundation models actually worth it?
It’s Progress, but It’s Still Bad
Planning is an essential capability that has obsessed AI researchers worldwide for decades. It allows humans to face complex problems, consider the global picture, and strategize the best solution; it’s one of the key features that separate us from the rest of animals.
In particular, planning is essential when facing open-ended questions where the solution isn’t straightforward, and some search must be done to identify the best solution. Planning not only elucidates feasible options but also breaks down the problem into simpler tasks.
Most complex world problems fall into this definition, so almost nothing would be done without planning.
Sadly, LLMs are terrible at planning.
One of the Hardest AI Problems
When taking our state-of-the-art LLMs and testing them into planning benchmarks with examples like the task below, which are simple to solve for humans:
The results are discouraging, with the results on the ‘Mystery Blocks world’ benchmark being a net 0% of correct questions across ChatGPT, Claude, Gemini, and Llama.
This isn’t that surprising, considering all these are System 1 thinkers. This means they can’t iterate over the question to find the best possible solution, and instead, we simply expect them to get the correct plan in one go, something that even humans would struggle with.
But as we discussed two weeks ago, o1 models represent this precise paradigm shift, allowing them to explore possible solutions and converge on the right answer.
However, the results are underwhelming, but not for the reasons you think.
Overconfident and Dumb
At first, things look great. When evaluated on the same dataset, o1 models considerably improve the outcomes, saturating the first benchmark (97.8% accuracy) and increasing the results from GPT-4o’s net 0% to 57% with o1-preview.
It seems like a really promising improvement, right?
Yes, but if you look carefully, things get ugly. For starters, if we increase the number of steps to solve the plan, accuracy will quickly crash, going back to 0% when the plan requires 14 steps or more.
We also need to consider the costs. As you know, these models are sequence-to-sequence models (they input a sequence, and they give you one back). Thus, the cost of running these models is metered by the amount of input and output tokens, aka how many tokens they process and how many they generate.
Usually, the generating costs are around 5x the processing costs, meaning that the more tokens you generate, the worse. Sadly, this goes entirely against the nature of o1 models.
As they reason over their own generations, they generate more tokens by design, sometimes reaching 100 to 1,000 times the average output size of GPT-4o. This leads to o1-preview being 100 times more expensive than the previous generation, and only because OpenAI is heavily subsidizing the costs on our behalf (aka the model should be much more expensive).
Concerningly, the model also overthinks too much. If we look at the table below, to solve a 20-step plan (which in our case refers to a plan that requires rearranging blocks 20 times), the model has to generate 6,000 tokens, or 4,500 words (1,500 more than this entire newsletter), to find the answer.
And that’s the low bound of the image below, considering this is one-shot (the model can explore, but it’s given only one try to get the answer correct).
Long story short, with o1 models, costs are going to literally explode, as OpenAI researchers want them to, eventually, think for entire weeks before answering. Seeing the incremental improvements over the previous generation, you may think it’s worth it.
But… is it?
Are o1 models Worth it?
This research perfectly clarifies the biggest issue with current AI models: the depth vs breadth problem. No matter how broad we go, deep models still put LLMs to utter shame.
Cost Effectiveness
Unlike o1 models, claimed to be the ‘smartest AI models alive,’ another AI planning mechanism, Fast Downward, an AI that has been around since 2011, manages to get 100%, humiliating the mass-proclaimed ‘state-of-the-art models.’
Importantly, Fast Downward is fast and very cheap to run. It is several orders of magnitude more cost-effective than o1 models and has three times better performance at the bare minimum.
Fast Downward works similarly to o1; it uses a search heuristic to iterate until it finds the best answer. The key difference is that, unlike o1 models, which are hundreds of GB in size (or TeraBytes) it’s small (no LLM) and task-specific, meaning it can perform this search at a scale that o1 simply can’t.
To make matters worse, researchers also tested models in ‘unsolvable plans,’ which led to concerning results.
One Hell of a Gaslighter
When evaluating whether AIs can acknowledge a plan is unsolvable and not try to solve it no matter what, o1 is convincingly… mostly incapable of doing so.
As shown below, o1-preview identifies a problem as unsolvable only 16% of the time. This leads to a plethora of different examples where the model generates completely unfeasible and stupid conclusions while being extremely eloquent about it, making it look more like professional gaslighters.
In this newsletter, we already covered in detail why LLMs are already professional bullshitters.
This leads me to an unequivocal realization:
Maybe LLMs aren’t dangerous to humanity because they are too powerful, as {insert Big Tech CEO} will tell you, but because they are ‘convincingly flawed’; they deceive us into using them in complex situations they can’t manage, leading to disastrous consequences.
In a nutshell, a simple combination of GPT-4o with the ability to call Fast Downward would lead to performance that is orders of magnitude better than o1 models without all the extra gaslighting and outsized costs.
That makes you wonder, is this new paradigm worth it?
A New Form of Evaluating
As we commented on Sunday, incumbents are already factoring in Giga-scale data centers and models that cost hundreds of billions to train and hundreds of millions to run, probably considering o1-type models as ‘the new normal.’
But as we’ve seen today, their promises are largely unfulfilled, and even if they are showing progress, they are still lagging behind 10-year-old AI algorithms running for 100,000 times less costs and with more outstanding performance.
Now, if you ask Sam Altman, he will tell you that a larger scale (larger model and larger training and inference budgets) will not only leave narrower solutions like Fast Downward obsolete and, in the process, ‘discover all of physics.’
But let me be clear on this: this belief is based on absolutely nothing but the zealous conviction that larger compute budgets lead to larger outcomes, but no one, and I mean no one, can prove it to you.
Thus, whether you believe them or not is largely dependent on whether you trust people who are economically incentivized to spur those precise words.
We Need To Meter Costs
We need to start looking at costs.
It’s all fun and games when you only care for increasing performance 1%, even if that 1% requires millions of dollars more. For that reason, it’s crucial that we start measuring ‘intelligence per watt,’; talking to AIs it’s fine but pointless if you need the entire yearly energy consumption of an average US home to make a plan a human could do by itself.
But you probably won’t see that happening anytime soon because then the picture becomes horrific, and every single dollar invested in the industry is going into this precise yet uncertain vision.
In the meantime, Silicon Valley will simply gaslight you with the preach, ‘Worry not mortal, bigger will yield better.’ And while that has been true until now, when are we going to question whether it’s worth it?
Nevertheless, a trillion dollars later, LLMs still can’t consistently perform math, make simple plans, or acknowledge their own mistakes.
What makes us think an extra trillion dollars will do the trick? You know the answer by now: AI is a religion, not a science.
And I’m not hating on religions; humans need faith to make sense of their world, and that’s totally fine. But I don’t think it’s ok to sell us blind faith as certainty when faith-based arguments are served as ‘truth’ to gaslight a society that knows no better.
Religion is regarded by the common people as true and by the rulers as useful.
LLMs might be the answer, but we don’t know for a fact. Remember that.
THEWHITEBOX
Premium
If you like this content, by joining Premium, you will receive four times as much content weekly without saturating your inbox. You will even be able to ask your questions you need answers to.
This week’s new content includes:
70B Models Are the New Black: A look at why all labs are converging on the same model size.
An Arrogance that Knows No Limits: Silicon Valley Think They Know Best. But do They?
Until next time!
Give a Rating to Today's Newsletter |
For business inquiries, reach me out at [email protected]