• TheTechOasis
  • Posts
  • Fear Chinese Robots & LLMs, LLMs Know More Than They Show, & More

Fear Chinese Robots & LLMs, LLMs Know More Than They Show, & More

In partnership with

For business inquiries, reach me out at [email protected]

THEWHITEBOX
TLDR;

  • 🇨🇳 We Need to Talk About Chinese Robots

  • 😍 Alibaba’s Qwen 2.5 Coder is Amazing

  • 🥵 Generative AI is Stalling

  • 🫣 Apple’s New AI Product

  • 🎭 A New Unbeatable Benchmark

  • 🌊 Google’s New Flood Forecasting Model

  • [TREND OF THE WEEK] LLMs Know More Than They Show

OTHER PREMIUM CONTENT
Other New Premium Content

Learn AI in 5 minutes a day.

The Rundown is the world’s most trusted AI newsletter, with over 700,000+ readers staying up-to-date with the latest AI news, understanding why it matters, and learning how to apply it in their work.

Their expert research team spends all day learning what’s new in AI, then distills the most important developments into one free email every morning.

NEWSREEL
We Need to Talk About Chinese Robots…

As I explained to Premium subscribers in Tuesday’s newsletter, this delusion that China is behind the US in AI terms is quickly transcending into concerns. Now, not only does the US have reasons to be worried about the rapid improvement of Chinese LLMs (see below Qwen 2.5), but we also have robots that are as agile as many animals in the open world, as proven by the following video.

These quadrupeds display uncanny agility in open terrain. They can get over obstacles almost a meter in size, go down 50-degree slopes without falling, and seemingly move from four legs to two if the situation requires one or the other.

TheWhiteBox’s takeaway:

AI Cold War topics aside, perception and embodiment (having a body that provides sensory feedback about the world) were crucial in developing human intelligence.

Thus, having robots with improved motor control could naturally lead to more competent models, and hopefully, even ‘world models’ like the ones humans possess (think of world models as what you humans call ‘common sense’) that allow us to survive in the open world.

As to whether the video is ‘faked,’ like Tesla faked the robotic movements of its humanoids during the last presentation (they were teleoperated), based on the complexity of the movements, it’s extremely unlikely these robots are teleoperated, but who knows?

That said, I think it’s wise to be at least slightly skeptical about anything you see in AI today.

FRONTIER AI
Alibaba’s Qwen 2.5 Coder 32B is Amazing

Alibaba’s new Qwen 2.5 Coder series is scary good. On Tuesday, we explained how the 7B version already beat the original GPT-4 in some benchmarks despite being hundreds of times smaller.

Now they have officially released the 32 billion parameter version, and it’s impressive, to say the least.

The model matches the performance of most state-of-the-art models like GPT-4o and Claude 3.5 Sonnet, beating them in most coding benchmarks (although to be fair, the benchmark that really matters, the Aider Benchmark, shows Claude 3.5 Sonnet being considerably better).

TheWhiteBox’s take:

Benchmarks aside, this is again a wake-up call for the West; China is no longer behind.

While we still have to see a Chinese Large Reasoner Model (LRM) like OpenAI’s o1 models, the speed at which they have caught up should be concerning for the US, which seems more worried about unproven existential risks than working together to innovate and remain the supreme country regarding AI.

With the arrival of another hawkish administration against China, the tightening on compute to prevent it from getting access to state-of-the-art hardware will only worsen, knowing that the two other bottlenecks, energy and technology, are impossible to block (Tech developments always get leaked eventually, and energy-wise China eats the US for breakfast, especially in terms of renewables).

All things considered, this week’s news that TSMC, the largest chip manufacturing company, is blocking chip production to Chinese clients under the 7nm range (the range of advanced chips, for instance, NVIDIA’s previous and upcoming GPU servers, Hopper, and Blackwell, both run 4nm process nodes) further proves the idea that this blockade against China is about to escalate.

ADOPTION
Generative AI Hype is Stalling

As reported by a study commissioned by Slack, the adoption of Generative AI products is stalling. Since the last survey, the percentage of workers acknowledging the use of AI has only grown by 1 percentage point to 33% in the US and 4% up to 36% globally.

While these numbers appear to be quite good (one out of every three workers uses AI), the concern is that this number isn’t growing. Importantly, 48% of workers are afraid to confess they use AI for work for fear of being called lazy or ‘cheaters.’

TheWhiteBox’s takeaway:

With revenues growing very slowly, the concerns that AI is much more hype than reality are very real. However, the poor adoption and revenue numbers are usually justified because AI is still a technology in progress that’s evolving so rapidly that these concerns won’t be an issue soon.

But if we add that a recent report by The Information suggests that the training run of GPT-5 (OpenAI’s next big model) has shown diminishing returns compared to previous generations (it’s still better, but the leap isn’t as large as previously expected), coupled with public concern displayed by OpenAI investors a16z that larger compute expenses aren’t leading to outsized increases in performance, could lead to a very reasonable correction in the overhyped expectations surrounding AI.

In case it’s not obvious, I’m looking forward to this correction, as it may alleviate the pressure on AI labs to continue pushing intelligence forward no matter what and, for once, start building reliable products that produce actual value.

The concern, however, is that if the narrative shifts toward product instead of this ‘AGI is near’ vision, investors will be much more skeptical about Hyperscalers each pouring $10 billion-plus into AI every quarter. In turn, this could also lead to a considerable drop in their valuations as investors become much more concerned about adoption.

SMART HOMES
Apple’s New AI Product.

Apple is reportedly closing in the release of a smart-home AI-powered screen display (similar to a home security panel) by March, as per Bloomberg.

This product is allegedly a priority for Apple and showcases some cool features, like sensors that detect humans close by and, depending on distance, display some information or another.

Of course, the product will be tightly connected to other Apple devices you may have, like the iPhone, and Apple hopes most people will heavily leverage its AI, Apple Intelligence, to control the system.

TheWhiteBox’s takeaway:

It seems Apple’s commitment to becoming a household name in the smart home market (no pun intended) is clear. In addition to this product, which should be retailed at the $100-250 price range, they are also working on more premium devices like a robotic arm serving as a “smart home command center, videoconferencing machine and remote-controlled home security tool.”

Either way, Apple’s strategy is clear: essentiality. With AI, many industries are going to be democratized. Thus, relying on having the strongest brand in the world won’t cut it.

In my opinion, Apple is trying to immerse itself into our lives in a way only one other company, Amazon, has managed, becoming so essential and culturally present that they are impossible to kill.

And becoming the software of choice for smart homes is one hell of a way to achieve that. That said, they better get their shit together with their poor AI capabilities soon because both Amazon and Google are also looking to clinch that spot… and their AI products are much better.

BENCHMARKS
A New Unbeatable Benchmark

Epoch AI, in collaboration with prestigious AI researchers, Harvard-educated PhD mathematicians, Field medal winners, and even the man considered the most intelligent human alive, Terence Tao, has published FrontierMath, a set of extremely complex math problems that even some of the most brilliant people alive can’t solve alone.

In the words of the brilliant Tao, “These are extremely challenging… I think they will resist AI for several years at least.”

As you can see in the image above, Terence’s prediction could be true, considering that our best models today solve just 2% of the tasks in the benchmark. Only Gemini 1.5 Pro and Claude 3.5 Sonnet reached 2% (probably thanks to the long context window). For reference, none of OpenAI’s models reached over 1%.

TheWhiteBox’s takeaway:

The question here is, what will happen when AI solves this benchmark? Will that be the unequivocal metric that tells us we have achieved AGI?

To me, the key question is whether these problems can be solved through memorization. In other words, if OpenAI researchers can lay out the resolution to these tests and then feed it to the model so that the latter simply has to replicate the process, would that count as truly intelligent? In my humble opinion, that would be a hard no: a cheated benchmark, no matter the task's complexity, can never indicate intelligence.

However, if an AI solves this benchmark without using its past experience, aka encountering the problems in a novel situation and still solving them, that’s another story. If this benchmark is solved this way, we can be sure that we are into something that could change the world forever.

FLOOD FORECASTING
Google Releases New Flood Forecasting Model

Google is releasing a new version of its flood-prediction AI model, which has expanded its prediction capabilities to 100 countries and 700 million people.

Floods are becoming the most common natural disaster, doubling their occurrence pace since 2000 and having killed thousands this year alone in places like very recently my home country, SpainAfghanistan, and Brazil, among many others, just this year alone.

Crucially, the scale at which Google can access data and train these models has led to models that can predict floods seven days earlier than other methods, a crucial time-saving that could save thousands of lives.

TheWhiteBox’s takeaway:

Discovery. Discovery. Discovery.

It’s not productivity products like ChatGPT that actually solve real-world problems. It’s the AIs that find crucial insights into our data to save lives the ones that should actually matter.

As we have discussed countless times, AI’s greatest superpower is pattern recognition, finding hidden regularities in data that give us a sense of why things happen.

AI can digest an endless number of data points about floods based on rain, humidity, temperature (these are illustrative examples), to find the key signals that a flood may be approaching.

For instance, these models may find relationships such as (‘when the temperature drops x degrees and humidity increases; there’s a flood likelihood increase of x%’). Excitingly, the model is fully open-source, and you can check its predictions per river on FloodHub.

Let’s hope this research by Google incentivizes LLM-only labs such as OpenAI or Anthropic to explore discovery-based AIs that yield real value.

TREND OF THE WEEK
LLMs Know More Than They Show

What if the secret of hallucinations lies within?

A group of researchers from Apple, Technion, and Google Research have made a fascinating discovery: Large Language Models (LLMs) seem to understand ‘truth’ and that this truth sometimes doesn’t align with what they give back. In other words, they seem to know more than what they show.

This finding could have huge repercussions, not only in the portrayal these models have in our society but in new techniques to mitigate hallucinations by forcing out the ‘truth’ from within them.

Crucially, it will also introduce you to a new and unique way to fight hallucinations.

Toward The Search For Truth

Hallucinations are probably the biggest issue with LLMs today, describing instances where the model outputs something inaccurate.

Uncertainty Modeling and Hallucinations

Hallucinations are attributed to many ‘illnesses,’ mainly commonly associated with a model’s uncertainty about a prediction. LLMs assign a ‘likelihood’ probability to their output, aka not only do they give you words back, but they also give you the possibility, in the form of a probability, that the word is accurate—or not.

For instance, it may output ‘London’ as a response to the questions “What’s the capital of England?” and “What city has the largest tea consumption?” with a likelihood of 99.9% and 75%, respectively. Thus, it’s natural to assume that the latter is more likely to hallucinate; the model still thinks it’s London, but it isn’t quite ‘sure’ (nonetheless, the second answer would be wrong as Turkey has a much larger tea consumption per capita than England).

While this has been the primary method for hallucination mitigation (measuring uncertainty and avoiding outputting responses when the certainty is low, as we saw in a recent Notion piece), what if the model’s output isn’t the best place to look for a model’s certainty on its responses?

An Introspective Look into Hallucinations

This group of researchers has posited the following questions: What if we are looking in the wrong place? And what if the model is hiding the truth from us?

To investigate this, they decided to test whether they could predict whether an LLM would make a mistake based on the representations of the words they eventually output.

But what does that even mean?

To answer, we need to understand how AI models work. Let’s say we have the input “The cat climbed the tall…” By the time you read the word ‘tall,’ based on the previous words, you are primed to suggest the next word might be “tree” or “table” because both make intuitive sense as things a cat would climb. On the contrary, the word “sea,” isn’t an option, because “tall sea” doesn’t make sense in the context of cats (or much sense at all).

Thus, your prediction of what word comes next is based on two things:

  • The sequence, which is referring to a ‘cat.’

  • Your internal representation of what a ‘cat’ is and what things a ‘cat’ might climb.

Well, AI models, especially Transformers like ChatGPT, do the same thing. As we saw last week, AI models work with numerical representations of words.

This way, as words become numbers, grammar, syntax, and semantical relationships between words become mathematical operations. Importantly, the closer the vectors representing two words are, the closer their semantical relationship is:

Source: Google

If we look at the diagram below, we see a partial visual representation of what’s going on inside an LLM regarding the first bullet, which is the attention mechanism LLMs perform with the inputs we give them.

To predict the next word, they take the last word they predicted and update its meaning based on the relationship of this word to all previous ones. Looking at the diagram, the word ‘Of’ becomes not only a preposition expressing the relationship between a part and a whole but also captures the meaning of previous words in the sequence.

Intuitively, LLMs do the same thing we unconsciously do when we read a text paragraph; upon reaching the last word, they have carried the meaning of the entire sequence, giving them intuition into what word might come next after reading the last one.

In parallel, although not shown above, the model adds meaning to these words based on its knowledge of the topic, similarly to when you discarded ‘sea’ in the cat example as nonsensical compared to more reasonable words like ‘tree.’

Importantly, this process is repeated several times, gradually updating the meaning of each word in the sequence (as if the model had to re read the text sequence several times to fully grasp its meaning).

In fact, when you see models much larger than others, like Llama 3.1 8B compared to its larger brother Llama 3.1 405B, the main difference is precisely the number of times the process we have just described takes place.

After several rounds of understanding, the model is ready to predict the next word. But instead of simply predicting the next word, researchers looked into this previous step to see whether we can understand what information the model has captured and, importantly, whether it has a more profound intuition about whether its next prediction will be right or wrong.

And, fascinatingly, it does.

Finding Truth from Within

To find whether a model encodes the ‘truth’ during this process, researchers use a probing classifier, another model that inputs the model’s representation of the next word and predicts whether the actual output will be correct or not.

This prober is trained by showing previous examples where the chosen output word was correct and also when it was incorrect. In other words, the classifier acts as a ‘hallucination detector’ that can predict whether the upcoming model prediction will be right or wrong.

In layman’s terms, you can think of the classifier as a similar idea to a device inserted into your brain that, before you answer, measures whether you actually believe your next response is correct or not.

This means that if this classifier can predict whether the actual outcome will be right or wrong, it has found common patterns across all instances when the model’s representation actually encodes the ‘truth.’ In other words, it has found the patterns that tell us whether the model thinks that what it’s about to output is true (or not).

But wait. Aren’t LLMs already giving us this certainty prediction when predicting the output, as they assign each word a likelihood probability as a possible continuation to the sequence?

Yes… but not quite. You see, LLMs are trained to predict the most likely next word. However, likely does not mean true.

In other words, a false output can be more likely than a true output if the false output is more commonly seen in training. Put another way, LLMs are not designed to always output the truth but the word that is most likely to go next based on the model’s experience. And as the Internet has plenty of lies, well, the model is designed to output those independently of their ‘rightness.’

Put another way, due to our design choice for LLM training, we may be forcing the LLM to output a response the model knows is wrong. In other words, the model may know the correct response is another one but still give you an incorrect one because it’s a more likely continuation, as that is what it was designed to do.

That means that the probabilities the model assigns to each word are a measurement of ‘likelihood certainty,’ and what the classifier is telling us is a measurement of ‘truth certainty.’

And did it work?

In all analyzed scenarios, the classifier could predict whether the output was right or wrong, telling us that, in some shape or form, the LLM was displaying some sort of ‘certainty’ measurement as to whether it was right or wrong.

Fascinating stuff!

TheWhiteBox’s takeaway

Technology:

This research is a beautiful exemplar of how little we know about LLMs. Think about it, we are just starting to discern whether they understand concepts like ‘truth.’

Products:

Product-wise, classifiers that can predict whether the upcoming output will be correct or not could soon become a staple of well-functioning organizations using AI.

They could be used as mitigation measurement, aka ‘if the classifier says the model is about to be wrong, don’t respond,’ alleviating many of the undesirable outcomes related to hallucinations.

Markets:

Markets-wise, the more we know about hallucinations and how to prevent them, the better for the industry, in dire need of meaningful adoption that won’t happen as long as hallucinations are a thing.

THEWHITEBOX
Premium

If you like this content, by joining Premium, you will receive four times as much content weekly without saturating your inbox. You will even be able to ask your questions you need answers to.

Until next time!