Agentic Workflows & The Hack Framework

HACKS, FINAL ROUND
Agentic Workflows & The Hack Framework

After two weeks of hacks to turn you into a Generative AI pro user, this week, we see the final and most advanced set of tips, mainly around our last pending point, agentic workflows.

We have covered this topic in the past, but today, we are looking at it from a new, more nuanced light, including updates based on the most impressive recent research, including agents who bribe other agents, fall in love, or have a tribal sense of belonging with each other.

Moreover, we will look at a final hack that I had not explained to you until now because, well, it was discovered this very week: reflection tuning. This new method is causing massive excitement as it has just created the best model in the world (beating Claude 3.5 Sonnet and GPT-4o) and is about to break all possible records with its new version coming next week.

Finally, we will define The Hack Framework. This guideline encapsulates all our learnings over the last three weeks and is a guide you can always follow: a guide to making excellence a reproducible process.

No Agentic Workflows, No Party

When something gets Andrew Ng as excited as it gets, you must pay attention.

  • He co-founded Google Brain (now merged with Google Deepmind) and is the former Chief Scientist at Baidu.

  • He is also a Stanford professor and co-founder of Coursera, which is known for popularizing AI education. And if that wasn’t enough, he is the source of some of the best free courses on Generative AI through Deeplearning.ai.

Over the last few months, Andrew has been obsessed with something he defined as ‘agentic workflows,’ a set of four techniques to enhance a Large Language Model’s (LLM) capacity to deal with more complex tasks that require one (or more) of the following:

  1. Reflection

  2. Tool use

  3. Planning

  4. Multi-agent collaborations

And the impact of such workflows is huge.

For instance, as you can see below, when wrapping a GPT-3.5 model (the model that saw the birth of ChatGPT back in November 2022) around an agentic workflow, its performance on HumanEval, arguably the most popular coding benchmark for LLMs, skyrockets to 95%, almost 30% more than its older brother GPT-4 released six months later and still, to this day, considered a frontier AI model.

And when GPT-4 is wrapped around a multi-agent agentic workflow (more on that later), it reaches even higher, almost topping the benchmark.

Source: Andrew Ng

And the best part is that this is not hard to build. Let’s look at every case.

Reflection

In simple terms, the model reflects over its responses. After generating an initial output (e.g., writing code), the model is asked to evaluate it for correctness, style, and efficiency, providing constructive feedback.

The model is then prompted to use that feedback to improve its response iteratively. Repeating this process can lead to further refinements, enhancing it’s ability to perform tasks like coding, writing, and answering questions more effectively.

For example, you may first ask Claude 3.5 Sonnet to generate a code snippet for a given task. However, at times, that code won’t be exactly as you expected, or worse, it may not even be compilable/interpretable for actual use.

Therefore, what reflection encourages is to add an extra step in which you provide the model with the original request, the previously generated code, and a new task along the lines of ‘Review the code you generated previously for correctness, style, and efficiency, and give constructive criticism for how to improve it.

Source: LangGraph

Well, this seemingly unremarkable self-reflection task gives the model an uncanny boost in performance, as it can immediately identify its own mistakes, taking, as mentioned, GPT-4 from 67% to 92% accuracy on HumanEval (above image).

Importantly, self-reflection is something that you can apply immediately to your interactions with LLMs.

If you’re using the web browser version, you can engage with the LLMs in this style (inside the same conversation), too, explicitly requesting the model to inspect, critique, and reformat/rewrite previous text, code, or images.

If you’re using the API, you can turn the model into a reflective one simply by wrapping the model inside a for/while loop. However, you can use tools like LangGraph, which abstracts the complexity by defining methods that automatically apply this reflection without you having to code the low-level code.

Reflection doesn’t necessarily have to be on oneself (self-reflection). As we’ll see later, you can engage models in multi-agent conversations.

Next, we have tool use.

Giving LLMs The Ability to Take Action

This is an agentic workflow in which an LLM is given access to external functions to gather information, perform tasks, or manipulate data.

One prevalent example of tool use occurs when the LLM has to access data it does not know, be that because it’s provided ‘in-context’ such as in Retrieval Augmented Generation (RAG), or simply because it’s real-time data that the model couldn’t possibly know (all LLMs have a knowledge cutoff).

In both instances, the model utilizes an external tool to enrich its knowledge, such as a vector database or knowledge graph in RAG or a web search service like Brave or Bing APIs when accessing ‘the latest news in the US election.’

If you’re an API user, thanks to function calling you don’t have to think of every possible word combination by users requesting updated information. It’s the LLM itself which acknowledges the need to call the web search tool autonomously (you have to provide the tool and manage the logic nonetheless).

Even though tool use is best leveraged by accessing the model APIs, you can also enable tool use on your ChatGPT account by creating custom GPTs in the web browser versions of OpenAI and Gemini (named CustomGPTs and Gems, respectively) and connecting them to third-party services.

Still, it requires some knowledge of JSON and API calls, as seen below for OpenAI’s case:

Source: Author

Coming up next, we have planning, another extremely hot tool.

Planning, The Gift That Keeps on Giving

I won’t dabble too much on planning because we actually saw it beforehand in this series with the 'break down’ hack. As mentioned, LLMs are naturally inclined to perform simple tasks.

Thus, when facing a more complex task, either breaking it yourself or actively asking the LLM to break it down into more straightforward tasks and execute each separately will increase performance.

One way of enhancing a model’s ability to plan that is gaining traction is through its combination with search-based algorithms like Monte Carlo Tree Search, which we covered in detail a few weeks ago in our review of the Agent era.

But as planning is reasonably self-explanatory, let’s move on to the most interesting of the four.

Multi-agent Frameworks

In multi-agent frameworks, the definition is in the name; we combine different agents (think of additional LLMs) and make them collaborate to increase performance.

The idea of combining multiple agents into one isn’t new. Last year, we saw the emergence of the ‘Society of Minds’ paper, an MIT piece where the researchers proposed collaborative environments between agents.

This led to pretty remarkable situations where two LLMs, both getting the question wrong initially, managed to reach the correct result through open collaboration by analyzing each other’s responses and using them to refine their own.

Unsurprisingly, we soon saw the emergence of collaborative environment tools like Microsoft’s Autogen, which streamlines the creation of these collaborative environments between LLMs.

With Autogen, you can even create teams of agents, each with its characteristics and duties, and make them work as a team for you. We even have open research on entire coding development teams based solely on individual AIs working as a team under one human’s control. An AI team for you.

Source: Chatdev paper

More recently, Together.ai presented Mixture-of-Agents, in which a set of open-source LLMs work together in an iterative self-refinement loop.

This led to results that clearly surpassed those achieved by individual frontier AI models, despite the latter being much more powerful in raw intelligence than the individual agents.

In a nutshell, agent collaboration is an immediate and guaranteed improvement in performance. But nothing works better as an ‘eye-opener’ about multi-agent collaboration than Altera’s very recent Project Sid.

As explained by the CEO himself, they deployed 1,000 individual AIs into a Minecraft server, leading to remarkable exploration and creation of items and some uncanny behaviors, such as:

  • The biggest tradesman in the village was, surprisingly, the Priest. How? He autonomously decided to use those trades to convert people to his religion (a euphemism for bribing).

  • They simulated realms for Trump and Kamala, each with its unique constitution. Under Trump, police increased in number, and security was always guaranteed. Under Kamala, the AIs removed the death penalty and focused on policy reform.

  • Finally, when a town’s folk went missing, the AIs autonomously agreed to deploy torches throughout the entire village to help the lost townsfolk back into town with the generated light. Did the AIs develop a tribal sense of companionship?

Long story short, when AIs collaborate, magic happens. By now, I assume you are convinced that AI progress isn’t only about increasing the model’s raw intelligence, and we are certainly leaving massive power on the table by not engaging our LLMs in agentic workflows.

Knowing this, we are finally ready to see how the entire three-part series falls into place with The Hack Framework, a holistic approach to all the hacks we’ve seen that you can apply for every task you face from now on.

The Hack Framework

This framework is divided into six steps to engineer any task you face with LLMs. It is easy to follow and immediately applicable.

The framework covers what models to choose and when, how to analyze the use case, what key metrics to follow, how to prepare the prompt, whether the workflow needs to be agentic, and a final acknowledgment of testing.

Step 1. Choose your model

As discussed two weeks ago, LLMs are not ‘one size fits all,’ as incumbents would like to tell you.

Nonetheless, as reliable sources like Scale AI or Lmsys leaderboards have shown, being the best in maths doesn’t make you the best in language translation.

But you shouldn’t trust what benchmarks say; you should benchmark the models yourself for each task. It takes minutes to test and can determine success or failure.

Simply put, while I can recommend ‘Claude 3.5 Sonnet’ for code or ‘DeepSeek-v2-coder’ if the programming language is lesser known, you must test and verify it yourself.

Step 2. Analyze your use case

In step 1, you should have at least tried various models to see which one has a higher chance of success preemptively. But now it’s time to evaluate the use case itself.

  • Step 2.a: Is it native?

Native use cases are those for which the model has been explicitly trained. As neural networks learn by trial and error, they are extremely biased toward high-frequency data (data they’ve seen often). Thus, native use cases are tasks models have seen repeatedly, maximizing their chances of working.

To check whether your task is native, if it’s an open-source/open-weights model, those use cases are mentioned in the research paper.

If you’re using a closed-source model (ChatGPT et al.), the labs usually include recommendations on their websites (for inspiration, click here or here for OpenAI, here for Anthropic).

You can make a use case native through fine-tuning, but that goes well beyond today’s framework, which is exclusively about prompting hacks and better and applicable decision-making.

Moving on, if it’s a native use case, jump straight to step 3. Otherwise, do the following:

  • Step 2.b: Analyze your error-tolerance.

If your use case has low error tolerance (errors are expensive), simply discard LLMs, at least for now. LLMs guarantee an inevitable error percentage due to their non-deterministic nature; non-native cases only exacerbate that due to their low frequency.

If the error tolerance is high, you may still apply LLMs, but the use case will automatically define your success metric.

  • Step 2.c: Automatic metric assignment.

Although we will review each metric in detail below, non-native use cases with high error tolerance are categorized as follows:

  • Back-and-forth (user/assistant) use cases requiring reasoning will force you to measure your success with an Intelligence metric.

  • Back-and-forth use cases requiring creativity will have a diversity metric.

  • If your use case is not back-and-forth but automated or scripted, you should focus on the reliability metric.

I can’t overstate how important this next step is. Choosing the wrong metric implies the wrong strategy, which implies certain failure. But what are these metrics, and how do we measure and improve them?

Step 3. Identifying and Maximizing the Performance of Your Main Metric

Before you ask, no, it’s not a good idea to try and maximize more than one metric because optimizing for one or the other is not an orthogonal action; maximizing one harms the others.

Subscribe to Full Premium package to read the rest.

Become a paying subscriber of Full Premium package to get access to this post and other subscriber-only content.

Already a paying subscriber? Sign In

A subscription gets you:
NO ADS
Get high-signal insights to the great ideas that are coming to our world
Understand and leverage the most complex AI concepts in the world in a language your grandad would understand
Full access to all TheWhiteBox content on markets, cutting-edge research, company and product deep dives, & more.