The Agent Era: Are We There Yet or Not?

PREMIUM SUBSCRIBER NEWSLETTER ISSUE
The Agent Era Is Here… Or Is It?

Let me warn you, as incumbents have decided. The ‘next big thing’ is here.

Every LinkedIn, Medium, X, or {insert your favorite/dreadest social media} influencer will, in the foreseeable future, endlessly generate ‘AI Agents Are Finally Here And They Are STUNNING’ type of content to tell you that they are the best thing that happened to humanity and that ‘the world is changing.’

But are they full of sh*t, or are they right for once?

To answer this, today, we are looking at the bleeding edge of AI agents, from agents that set a new record for ‘most autonomous AI software engineer’ to agents that navigate the web better than the average human, and even agents that write full scientific papers by themselves.

While some of these features are newsworthy by themselves, we have even witnessed one of these agents try to rewrite its code, an unforeseen event that has raised as many ethical and existential risk questions as anything we have seen before.

Overall, today, you will gain unique and detailed intuition into how the most powerful agents in the world work and what benefits they promise. Additionally, we will look underneath all this hype to answer the critical question:

How much will the world change over the next months, if anything?

A New Narrative

As someone who makes a living by reading AI academic research and helping people make better decisions with AI by separating truth from hype, the sheer scale of research and products coming our way indicates that the media will soon catch the trend, and the next thing you know the world is flooded with AI agent hype, making it really hard to separate reality from unsubstantiated claims.

But first, what is an agent?

Agents are the same models you’ve been reading about over the last two years but with the added capacity to take action.

In other words, they are Large Language Models (LLMs) or multimodal variants of them (models that also handle image, video, or audio) that not only can chat with you and take instructions, but will act on them.

To do so, they interact with tools (other software) to perform the actions using two known methods:

  • Direct interface access. The model goes into the software user interface (like seeing a Chrome window), understands what it’s seeing, and interacts with it.

To process and act over the interface, the model is given access to the DOM (the underlying code). If you’re familiar with Robotics Process Automation (RPA), they interact with the applications in the same way RPA robots do.

You can see your DOM using Chrome’s Dev Tools

  • APIs. Some software allows you to interact with it without seeing the user interface but through a programming interface, or API.

Here, the LLM generates a structured output, usually in JSON form, and executes a script to send the API a call to perform the action. The action is then executed in the third-party software without requiring the model to interact with the UI like a human would.

Therefore, essentially, it’s the same thing we have seen over the past few years, but now they are growing their capacity to take action. And what are the catalysts that have fueled this excitement around agents?

A Range of Powerful Agents

The recent announcement/release of several powerful agents has really shifted the narrative toward them.

Apple Intelligence

Starting with the most humble, we have the agents coming to Apple iOS sometime in the Spring of next year. Specifically, an on-device, 3-billion-parameter model named AFM will perform various actions on your behalf, like replying to emails, summarization, or executing and scheduling calls.

For more complex actions, they will have a server-side model (AFM-Server), and even ChatGPT for the most complex cases.

Notably, despite their small size (to fit into the iPhone), these models were trained explicitly for tool use, outperforming both Gemini 1.5 Pro and GPT-4 in this facet, clarifying the first key principle: you don’t have to be huge to be a good agent.

Source: Apple

Gemini Live

As we discussed on Thursday, Google has also announced agentic features for its Google Pixel line-up (which will also be available via app for iOS and Android in general).

Similarly to Apple Intelligence, among the different features, we have call notes, which summarize your phone conversations, and Gemini Live, which allows a Gemini model to interact with many of your applications to take action on your behalf (similar to Siri).

This seemed to be a more powerful display than Apple Intelligence, but the issue is that the model isn’t on-device, which could cause serious privacy concerns.

Genie by Cosine

Another powerful entrance this week was made by Genie, a new ‘autonomous’ software engineer who beat Devin, the agent who made waves back in March for its impressive coding capability.

As shown in the video, Genie uses the same principle (an autonomous robot coding over a GitHub repository) but improved, clinching the top spot in the very popular SWE-Bench benchmark, which evaluates the capacity of AI Agents to solve GitHub issues.

This is very impressive, as it’s a very challenging task for the model, as it needs to understand the full repository, navigate effectively between different folders and files, and write, execute, and verify code autonomously, which inevitably requires the model to be capable of coding, planning, and reasoning.

Source: Cosine.sh

In their case, they did not add any fancy futuristic features like our next agent but ‘simply’ fine-tuned an OpenAI model with data they collected by observing real software engineers solve issues.

Impressive. But here is where things get truly serious: The next agent we are looking at is something truly worth delving into with great detail.

AgentQ, An Extremely Powerful Agent

As we saw on Thursday, MultiOn has announced a new agent, AgentQ, a jaw-dropping product that consolidates much of what I’ve been discussing in this newsletter for quite some time.

But this agent is special.

Unlike others, this one can learn from good and bad decisions, allowing it to correct itself—or ‘self-heal’ as they call it—and become the first agent that, to my knowledge, surpasses human capabilities in web navigation tasks in narrow settings.

In short, it’s the first ‘LLM+MCTS’ agent I’ve ever seen, an LLM that has been given the capacity to explore both in training and inference.

But what does that mean?

From Move 37 to Web Navigation

When AlphaGo defeated the world champion, Lee Sedol, in a best-of-5 game of Go, a Chinese board game, it was a defining moment in AI history, and Google Deepmind achieved it by training a model on a Monte Carlo Tree Search (MCTS) algorithm.

In a nutshell, MCTS works by helping a model make the best decision possible in every state (here, the state would be the current board state). To do so, it performs ‘Monte Carlo simulations’ or ‘rollouts,’ in which the algorithm looks into the future by simulating the different possible outcomes going down a specific action trajectory.

For instance, in the Go game, AlphaGo (or AlphaZero, a better version) simulates what events would unfold for each possible action they choose, simulating until reaching a terminal state (in this case, winning or losing the game).

Based on that simulation, it measures the expected cumulative reward of that rollout (or the likelihood that going down that path will be a positive outcome for the model, aka winning the game). In other words, the model chooses its following action based on the highest expected cumulative reward at any given time.

But what makes MCTS work so well is that it also incentivizes exploration.

In other words, the model doesn’t always choose the decision with the highest expected reward, but also explores actions with smaller expected rewards that could, in the future, yield higher returns by unveiling unexpected new strategies.

This allows the algorithm to choose what appears to be worse actions for the benefit of more remarkable future outcomes instead of the ‘no-brainer’ option every time, which can lead to suboptimal training.

This exploration capacity, coupled with its rollout simulations, led to one of the greatest moments in AI history: the infamous ‘move 37’. This seemingly dumb movement (low immediate reward) ultimately proved crucial to winning the game against the best human at the time.

If you don’t know the rules of Go (like me), a comparison to ‘Move 37’ in chess would “The Inmortal Game” between Adolf Anderssen and Kieseritzky back in 1851, when Anderssen sacrificed the queen, a seemingly stupid move, only to checkmate its rival a few moves later.

But if MCTS is so good, why isn’t every LLM using it?

The problem with MCTS, and Reinforcement Learning techniques in general (models that learn by measuring the reward over their actions) is that the world is full of sparse rewards.

In layman’s terms, it’s hard to implement MCTS when measuring whether the robot’s action was good or not. Without going into robotics, where measuring rewards has become an AI field in itself (as we covered), measuring a reward is sometimes a matter of significant subjectivity.

For instance, considering two summaries of a given text, which is a prevalent task for an LLM, how do we choose which is best?

This underscores the biggest problem with Reinforcement Learning today: outside constrained environments like a board game with clear signal of who’s right or wrong, it barely works… or does it?

Because that’s precisely what AgentQ has done in LLM web navigation.

Autonomous Web Navigation

In simple terms, AgentQ fuses the worlds of LLMs and MCTS to create an agent that can surf the web and perform actions on it. In the image below, the model can book a table for its user, even correcting it after making a mistake on the first search.

Technically speaking, the model does the following:

  • Having access to the past actions and the current website state (DOM),

  • it generates a set of possible actions (the plan)

  • it then measures the cumulative reward expected from choosing each action (the Monte Carlo rollouts we mentioned earlier), balancing if that trajectory has been explored enough (it only explores trajectories that have not been thoroughly explored)

  • executes the chosen action

  • In addition to the action, it explains why it chose the action it did, acting as an ‘inner dialogue.’

But an unanswered question remains:

How does the model measure the reward in a scenario like this? Aren’t we in one of those cases where measuring the model's actions is hard?

Yes, which leads us to the biggest contribution from AgentQ’s researchers: the ‘AI critic.’

Combining AI and rollouts as signal

The actual process the model undergoes is this one:

  • When tasked by the user, AgentQ first uses its LLM to propose an action plan,

  • then, it verbalizes (literally writes down) the reasoning behind the plan. In other words, it writes down why it will perform each step of the plan.

  • Next, it lays out the possible actions it can perform at a given step, sending them to the AI critic (image below),

  • The AI critic then reorders the actions according to what it considers the best potential outcomes (again, as seen in the image below)

  • Then, the model chooses an action based on a combination of the critic’s feedback and the expected reward for each rollout (the expected reward of taking a specific direction).

  • Finally, the model verbalizes an explanation as to why each chose the action, in case it has to reason over it in future actions.

Consequently, what would otherwise be a highly complex problem with hard-to-measure rewards can be solved with an additional AI that judges the outcomes of the model’s actions with pretty solid accuracy, giving birth to AgentQ.

The AI Critic process

I can’t hide my excitement on this. In this newsletter, we have been all over LLMs+search as the new, super powerful generation of AI models, so it’s nice that we are finally seeing their inevitable emergence.

However, before we really separate the hype from reality, and no matter how impressive AgentQ is, it hasn’t proposed rewriting its own code, which feels extremely dangerous and like something out of a doomsday movie.

But it happened in Japan just days ago by another agent: a mad AI scientist.

Subscribe to Full Premium package to read the rest.

Become a paying subscriber of Full Premium package to get access to this post and other subscriber-only content.

Already a paying subscriber? Sign In

A subscription gets you:
NO ADS
Get high-signal insights to the great ideas that are coming to our world
Understand and leverage the most complex AI concepts in the world in a language your grandad would understand
Full access to all TheWhiteBox content on markets, cutting-edge research, company and product deep dives, & more.