• TheTechOasis
  • Posts
  • Eliminating Hallucinations & Robots That Imitate Us

Eliminating Hallucinations & Robots That Imitate Us

šŸ TheTechOasis šŸ

part of the:

Welcome to the newsletter that keeps you updated on the latest developments at the cutting edge of AI by breaking down the most advanced systems in the world & the hottest news in the industry.

10-minute weekly reads.

šŸšØ Weekā€™s Update šŸšØ

Welcome back! Thereā€™s much to talk about this week.

Amazon has ā€˜acquihiredā€™ Adept, an AI company focused on creating AI agents that interact seamlessly with computer interfaces and raised hundreds of millions.

Around 66% of the companyā€™s staff will join Amazon, although Adept will remain an entity alone. The former will enjoy full access to its technology and IP, however.

This is an identical approach to Microsoft with Inflection, a de facto acquisition disguised as an internvention to avoid antitrust complaints.

Seeing how the shadow of big tech looms over larger portions of the market, I wouldnā€™t be surprised if Lina Khan, Chairperson of the Federal Trade Comission, takes matter into her own hands soon regarding this unconventional antitrust-proof acquisition methods.

Moving on but not leaving big tech, OpenAI has given Appleā€™s Executive Phil Schiller a new board seat. Thus, Apple will join Microsoft with non-voting seats on the board.

By the time of Queen Victoriaā€™s death, the ā€œgrandmother of Europeā€ had descendants in the throne (or married to) in seven European countries, the quintessential exponent of the centuries-long incestuous tendencies of European royal families.

Well, AI is becoming much like a royal European family, with the ā€œfab fiveā€ (Apple, Amazon, Google, NVIDIA, & Microsoft) owning a considerably portion of frontier AI labs and board seats.

In case you ever wondered why open-source is so important, bear this in mind.

Salesforce has announced, through its CEO founder Mark Benioff, XLAM, two powerful micromodels of just 1B and 7B parameters that compete with the best models around in function-calling.

These micromodels demonstrate how well an LLM can leverage external APIs as tools to perform actions.

Salesforce, a company that was punished (lost 20% market cap in one day) for falling back on growth (but more so due to their poor AI narrative), seems really committed to jump into the GenAI train no matter what.

While the announcement has not been shy of hyperbole and empty promises like Markā€™s ā€œThe age of Agentic AI is hereā€, bear in mind that these models are state-of-the-art for one task only, function-calling.

You can feel the pressure on this company to deliver AI results soon.

With previous examples covered in this newsletter like WonderJourney, itā€™s clear that AI labs have a clear interest in creating models that generate 3D objects or scenes, with potential huge impact on the gaming industry and, although I hate to say it, the Metaverse.

Finally, I encourage you to try Moshi (itā€™s free), a new audio model by Kyutai Labs with the smallest latency Iā€™ve ever seen. The model ā€˜thinksā€™ and speaks simultaneously, and it can imitate accents and emotions, too.

The model is very limited in terms of ā€˜intelligenceā€™, and can be hilariously rude. That said, the almost unnoticeable latency really gives a sense on what the future awaits us with AI voice agents.

šŸ§ You Should Pay Attention to šŸ§

  • šŸ¤– HumanPlus, Robots that Imitate Humans

  • šŸ¤Æ Lamini-1, Eliminating Hallucinations?

šŸ¤– HumanPlus, Robots Imitating Humans šŸ¤–

A team of Stanford researchers has proposed a full-stack method for robotics training, called HumanPlus, in which robots learn by imitating us directly.

Robots can perform several actions, including putting on and tying shoes, folding clothes, and playing the piano, many of which we have never seen an autonomous robot perform.

If scalable, this method could allow us to use the infinite data available from humans' actions in the real world to create humanoids that behave just like us, and that one company stands to benefit the most from this progressā€¦ and itā€™s not a robotics company.

The Robotics Problem

Saying ā€˜robotics is hardā€™ is an understatement. However, one method continuously provides tangible benefits: training in simulation.

By training models in digital environments to avoid training the robots from scratch in the real world, costs and safety concerns decrease exponentially.

With better software and training dynamics, our capacity to simulate the world has improved, allowing for the surge of robots that perform extremely dexterous actions with almost no real-world training.

In particular, AI has been recently explored as a training pal, where the AI refines the reward function the model uses to learn (in these settings, AIs learn by maximizing a reward based on their actions).

For instance, DrEureka and Pen-spinning are just two NVIDIA examples where robots trained by other AIs and in simulation perform very impressive tasks in the real world.

So, what did these researchers do differently this time?

HumanPlus: From Simulation to Shadowing

The first step involves training a low-level control policy using reinforcement learning in simulation environments.

This YouTube video gives great insight on how AIs are trained in simulation.

This training focused on pose retargeting, aka the robot learns to imitate human target poses. This ensures the robot can accurately replicate human movements and perform highly precise tasks.

Here the researchers are already making one important change from the status quo: they are aiming for a pose-conditioned policy, a policy where the robot is trained to reach certain poses instead of just replicating specific tasks.

By focusing on imitating key poses that are common across tasks, the model builds a good movement prior that will allow it to perform several tasks, instead of becoming too fixed on certain ones.

This outcome was a low-level policy, or the robotā€™s proprioception system, that allowed it to have good control over its ā€˜joints and muscles.ā€™ In other words, at this stage, the model can confidently imitate several human poses.

Importantly, building a good pose prior allowed researchers to deploy the robot zero-shot into the real world, meaning that the model could, with no physical training, imitate the poses in its physical form.

Next, we arrived at the shadowing phase, where the system integrated the robotā€™s proprioception (internal sensing) with real-time human pose estimation. In this phase, a human performs a series of movements that, through teleoperation, are replicated by the robot.

The key is that while doing this, the robotā€™s camera observes the human and captures the image features, effectively capturing a ā€˜videoā€™ of the poses it aims to imitate and its own body and hand gestures data during the replicated movement.

But what does the AI model look like?

Itā€™s All The Same

If we look at the bottom diagram, we see that in both training phases (simulation, left, and shadowing, right), Transformers, just like ChatGPT, leverage the attention mechanism to find the key patterns in the input data.

  • During simulation, the Humanoid Shadowing Transformer takes in a set of proprioception data (the data of its joints and body parts) and a target pose, and is trained to match that pose by moving the different parts of its body.

  • The Humanoid Imitation Transformer takes in the images from its camera and the current body state and, again, aims to replicate the poses of its body, allowing it to obtain the same visual output. In other words, it figures out what movements it has to make to obtain the same body and visual positions it saw during shadowing.

You may have realized Iā€™m not delving too much into the technical specifications, as the main takeaway I want you to extract from this research is that, at the end of the day, AI is always about transforming data from an input state into a desired one and that almost all AI fields and models, from robotics to computer vision (DINO) to NLP (ChatGPT), use the same mechanism to make this transformation: attention.

In other words, on a first-principles basis, HumanPlus is actually pretty similar to ChatGPT. The ā€˜onlyā€™ things we change are the data we use and the task we want to achieve.

Just like ChatGPT takes words in and predicts the next one, HumanPlus takes its proprioception state and camera frames and produces poses.

Thus, if you understand attention, you understand current AI.

TheWhiteBoxā€™s take

Technology:

As Alan Turing predicted in the paper that created the field of AI, 70+ years later, AI is still the Imitation Game. All AI models do (at least today) is imitate a desired behavior, be it imitating language (ChatGPT) or human poses (HumanPlus).

That said, HumanPlus adds a very interesting shadowing method that, while remaining imitation learning, facilitates the recollection of human data, a key constraint in robotics.

However, Robotics is still in the early days of making AIs imitate us well, in terms of physical actions and reasoning, so please bear this in mind whenever you see the latest AI hyper saying that ā€˜superintelligence is near,ā€™ no, itā€™s not.

Markets:

This is bullish forā€¦ NVIDIA.

Besides using NVIDIA GPUs to train the models, they are also the main source of simulation environments, like Isaac Sim, so seeing that both proprietary and open-source robotics are converging on the idea that robots are first trained in simulation and then transferred to real life is great news for them. Especially when they have 78% of their revenues in AI GPUs, they need to diversify quickly.

This is hawkish forā€¦ Private Robotic Start-ups.

Open source already has a massive deflationary effect on LLMs, and robotics could soon follow through. Companies like NVIDIA and Unis like Stanford are pushing great robotics research for free, so justifying the multiple hundred million investments in private companies like Figure.ai will become a challenge unless they present some step function improvement we have not seen yet.

Products:

The robot used in this research was the Unitree H1, which costs less than a car ($16k). Companies like Tesla are also aiming for competitive prices for the Optimus robot ($20-25k), too.

Thus, keep in mind that robots are becoming ā€˜affordableā€™ rapidly, and this vision of all humans having multiple physical robots could come to fruition before the end of the decade.

šŸ¤Æ Lamini-1: Eliminating Hallucinations? šŸ¤Æ

Lamini.AI has announced a new method, Lamini-1, named MoME (ā€œmommyā€), that promises to reduce LLM hallucinations by up to 95%, up to ten times better results than anything we have seen beforehand.

They also claim that some of their Fortune 500 companies already benefit greatly from this method. And I predict this method could create an explosion of enterprise demand.

As hallucinations are a central point of the issues for Generative AI adoption, MoME can be a real revolution for enterprises worldwide, and just another reason that today, contrary to what you might think, being an enterprise and implementing GenAI processes only with ChatGPT or Claude makes literally no sense.

But how does it work?

Breaking a Dogma in Style

As we saw a few days ago, the AI industry is packed with ā€œknown knowns,ā€ principles that are taken for granted, even when they clearly shouldnā€™t.

And Lamini-1ā€™s breakthrough is once again proof that nothing can be given for granted in AI.

The Overfitting Problem

ā€œWe have created a model that is purposefully overfitted.ā€

This statement would get you fired in almost any AI job today. Nonetheless, overfitting is by far a researcher/AI engineerā€™s biggest nightmare.

An overfitted model has memorized the training data, to the point that it canā€™t generalize, aka canā€™t be used for data similar to the training data but not exact.

As you can see below, an overfitted model that only seen dog photos with the dog always looking at the camera will, at the slightest change in the dogā€™s position, think the animal is not a dog.

Overfitting leads the model to learn overly complex patterns, preventing it from generalizingā€”being applied toā€”data that is still applicable but not identical to what it saw during training.

Overfitted models are also described as having ā€˜high varianceā€™, which Andrew Ng explains beautifully here.

However, overfitting might be the key to solving Generative AIā€™s biggest problem: hallucinations.

The Hallucination Problem

Picture yourself as an OpenAI researcher whose next training run will cost the company $50 millionā€¦ and the model overfits. Your career is over.

Luckily, with Large Language Models (LLMs), overfitting is rare, as the models are so large that itā€™s almost impossible for them to memorize the data fully.

For instance, LLaMa 3ā€™s main insight wasnā€™t an architecture innovation but that you could run several epochs of the data over the model without overfitting.

But what if we wanted the model to memorize the data to avoid hallucinations?

Nonetheless, the ideal setting would be an LLM that memorizes not the entire dataset but key facts (having zero loss for those specific data points) while still preserving its low generalization error.

Sadly, as Lamini-1 researchers point out, for an LLM to fully memorize facts, it needs around 100 epochs of training over the dataset, which is prohibitively expensive.

According to the paper, banishing hallucinations from LLaMa 3 using a standard fine-tuning would cost 350 MW of power, or $68 million at the average US industrial tariff, and also cause massive spikes in general error.

Luckily, Lamini researchers have figured out a new way to overfit only key data and generalize like any other frontier LLM.

Millions of Experts

Just to make sure we are clear, the goal of Lamini-1 is to create an LLM that, while still being able to perform various tasks like any foundation model, should have a 0% hallucination risk on facts.

However, current LLMs do not discriminate among data points, aiming for an ā€œaverage small loss value.ā€

For that reason, their probability distribution for any next token prediction (remember, for every predicted next token/word, the model ranks all its known tokens according to probabilities) will look like the one below in the center, where both 1981 and 1970 are ā€˜good enoughā€™ candidates and, thus, enjoy high probabilities.

But facts donā€™t work that way; facts are facts. Therefore, the model should assign a 0.00 loss to the token ā€˜1981ā€™ in those cases (giving 100% chance to that token and 0% to the rest), which is the only valid answer.

Consequently, as taking a pre-trained model and overfitting into a specific dataset would inevitably provoke generalization problems, Lamini-1 proposes an augmentation.

Experts and Adapters

The idea is quite simple: We freeze the LLM, which was already trained earlier, and run several (up to 100) epochs over the data we want to memorize, but training millions of surrogate experts so that each expert memorizes the facts.

As we leave the LLM ā€˜untouched,ā€™ we can cost-effectively train these millions of experts to the point of total memorization.

But how are these experts chosen?

The Cross-Attention Complement

Although I highly encourage reading my blog on the attention mechanism for full understanding, as discussed in HumanPlusā€™ article above, this is the process by which a Transformer-based LLM figures out the input sequence.

By making words in the sequence ā€˜talkā€™ and updating themselves with regard to the other words in the sequence, the model builds a general understanding of the sequence.

Specifically, when attention is performed endogenously (between words in the same sequence), itā€™s called self-attention. But, sometimes, you might want to perform attention with exogenous data; in that case, the process is called cross-attention.

For example, cross-attention is used in text-to-image models, where the text input by the user is used as a condition to generate a semantically-related image. In this case, to embed the meaning of the text description, we use cross-attention.

text-to-image model

In this case, cross-attention is used to pick the relevant experts. So, all things considered, what does Lamini-1 look like?

Lamini-1

Overall, the architecture looks like this:

The process is as follows:

  1. An input sequence is inserted into the model

  2. While the tokens in the sequence go through self-attention, as with any ordinary LLM, cross-attention is used to select the most relevant experts for the task.

  3. The experts are added to the network, augmenting it.

  4. Once the process is finished, the outputs of steps 2 and 3 are merged, giving us the new state of the sequence.

But what is the real effect of this architecture? If we look at the image below, the resulting loss curve is the pink one, as the following is taking place:

  1. The LLM with self-attention ensures that the error remains low no matter the input task.

  2. On the other hand, if the input tasks are fact-based, the experts guarantee that the error drops to zero for that particular prediction, ensuring that the model rarelyā€”if everā€”hallucinates.

With this, you get a model that, while still being able to generalize and perform a manifold of tasks robustly (low general error), doesnā€™t shit the bed whenever it has to output a specific fact.

TheWhiteBoxā€™s take:

Technology:

If the Lamini is proven correctly, their achievement is huge: finding a cost-effective way to train models for very large epochs to ensure memorization while keeping the foundation LLM's generalization capabilities untouched to ensure applicability.

Markets:

This is bullish forā€¦ enterprises that embrace open-source.

Iā€™ve already proven to you several times that AIā€™s missing piece is enterprise adoption. Hallucinations are always the main discussion point that throws the projects overboard, something Lamini-1 might be ready to change.

This is bad news forā€¦ private LLM companies.

Unless they allow for very efficient fine-tuning, which wonā€™t happen anytime soon, labs behind ChatGPT, Claude, or Gemini will soon see how most enterprises opt for open solutions, which they can then fine-tune with methods like Lamini-1 and obtain better results than what ChatGPT has to offer. For that, I envision an enterprise adoption strategy as follows:

  • Open-source models as fact retrievers and ā€œchatting with dataā€ use cases.

  • Private-source models for copilots.

Products:

This is very bullish for enterprise LLM software like Lamini, Databricks, Snowflake, Predibase, or NVIDIA (through their NIMS services), which may offer cost-effective, even serverless capabilities to fine-tune LLMs for every single task and generate the adoption of everyone has been waiting for.

šŸ§ Closing Thoughts šŸ§

The revolution of small language models is the key principle in this weekā€™s issue.

Microsoft, Lamini, Google, Apple, Salesforce, and so forth are all unapologetically focusing on SLMs, as they realize that general demand will come through these models way before it comes from frontier LLMs.

In the meantime, robotics continues to improve, and the promise of tangible applications in a few years is almost priced at this point, a field where, once again, the incumbents are positioned for triumph: Nvidia, through its research and simulation software, and the hyperscalers, through huge investments in companies like Figure.ai, 1X, or Field AI.

And if that wasnā€™t enough, big tech is disguising its increased grip in the industry through clever antitrust-resistant acquisitions, ensuring that no matter what the outcome isā€¦ it will win.

Do you have any feelings, questions, or intuitions you want to share with me? Reach me at [email protected]