• TheTechOasis
  • Posts
  • Should AI's Kill? OpenAI's Swarm, & The Future of AI Training

Should AI's Kill? OpenAI's Swarm, & The Future of AI Training

In partnership with

For business inquiries, reach me out at [email protected]

THEWHITEBOX
TLDR;

  • šŸ¦Ÿ OpenAIā€™s Swarm, The Future of Agents?

  • šŸ• Amazonā€™s AI Shopping Assistantsā€¦ & Coders

  • šŸŒ² Sequoia Capital Shares Its Updated AI Vision Again

  • šŸ‘‘ The King of Small Models is Back

  • šŸ«£ Should AIā€™s Kill?

  • [TREND OF THE WEEK] Intellect-1, The Future of AI Training?

The Daily Newsletter for Intellectually Curious Readers

  • We scour 100+ sources daily

  • Read by CEOs, scientists, business owners and more

  • 3.5 million subscribers

PREMIUM CONTENT
New Premium Content

This Week Iā€™ve Had Some Personal Issues, Hence Why I Havenā€™t Written As Much New Content as Usual. Next Week Should Go Back to Normal. Appreciate the Understanding.

NEWSREEL
OpenAIā€™s Swarms, The Future of Agents?

OpenAI has released a GitHub repository based on an insider piece published a few days ago. The idea is to streamline the creation of multi-agent scenarios where humans manage a handful of agents collaborating with each other autonomously.

They mention that itā€™s not meant for production, but letā€™s not forget that OpenAI is expected to release agent products in 2025. The idea is worth explaining because it provides insight into OpenAIā€™s plans and the risk it could represent to the future of many start-ups in the AI dev space.

Particularly, with Swarms, OpenAI introduces the ideas of routines and handoffs.

A routine is a guide the agent has to follow, written in natural language, for instance, on a customer support ticket (questions for context, refunds, escalation to humans, etc.). Of course, this routine would ideally be managed by one single agent, but they tend to make mistakes for long routines.

Therefore, OpenAI uses handoffs, giving agents the power to delegate tasks to more specialized agents. To do so, they propose using agent transfers as tools that an agent can call.

Tool calling is a fundamental part of AI agents. I dive deep into the idea here.

For instance, a ā€˜refund agentā€™ may be in charge of executing the refund function, but may also need to return the conversation to the context agent to finalize the interaction. Thus, the refund agent has another tool called ā€˜transfer_to_finalizeā€™, which, in reality, is an instance of another agent that receives the context and continues the interaction. The link at the top shows a complete example by an OpenAI employee.

TheWhiteBoxā€™s take:

Multi-agent frameworks are starting to look like natural evolutions of the status quo. While we have yet to find a genuinely generalizable agent (an AI with huge performance across all tasks), combining task-specific agents through seamless interactions is a no-brainer, and OpenAIā€™s intentions are unequivocally heading in that direction.

This also gives a clearer idea of the role LLMs will play in the future as a natural language interface between humans and machines. As Swarm shows, humans give instructions in natural language, and agents understand these and autonomously act on them.

At some point in 2025, humans might finally have a simple-to-use product with which they can create dozens of different agents to automate a large part of their daily work.

Itā€™s not AGI, but itā€™s undoubtedly transformational.

AMAZON
Amazonā€™s AI Shopping Guidesā€¦ & Unexpected Coders

Amazon launched a feature, AI Shopping Guides, that lets users find their ideal product much faster.

It helps you reduce the time spent researching before you make a purchase by proactively consolidating key information you need alongside a relevant selection of products. This makes finding the right product for your needs easier quickly and easily.

This includes Rufus, a shopping assistant branded as a dog with a really funny side-story youā€™ll see below, and it may even be capable of checking a productā€™s price history.

TheWhiteBoxā€™s take:

Funnily enough, some users have found ā€˜other usesā€™ for Rufus, even using it to code. As discovered by some users and echoed by famous YouTube Atrioc, you can make Rufus complyā€¦ with almost any request despite being allegedly focused on shopping topics.

As shown in the image below, as long as the first request is related to shopping, Rufus stops refusing afterward, basically complying with unrelated requests.

Source: Atrioc

This proves how brittle our current alignment methods are in protecting chatbots from deviating from a given ā€˜script.ā€™

If Amazon canā€™t do it, who can?

MARKETS
Sequoia Shares Its Updated AI Vision

Source: Sequoia Capital

Sequoia Capital, arguably the largest venture capital firm in the world and investor in companies like OpenAI, has shared its vision on the current state of AI.

It offers a profound reflection (straightforward to follow nonetheless) on ā€˜where are we in AI,ā€™ covering topics like the transition to ā€œSystem 2 thinkingā€ that o1 models imply (covered in my newsletter in greater detail here), the transition from AI models toward larger inference-time compute (models ā€˜thinkā€™ on a task for longer), and where they think the great opportunities for investing lie ahead.

Interestingly, they seem to have changed their minds on SaaS companiesā€™ future. Initially, they thought these companies would be just fine. Since then, they seem less sure, sharing similar feelings to mine, a topic I recently covered myself.

TheWhiteBoxā€™s take:

I always recommend Sequoiaā€™s blog. They are easy to follow, and they donā€™t mind sharing the rationale behind their investments, which is good food-for-thought. However, letā€™s not forget they are investors in OpenAI, so the hype around the AI is palpable.

For instance, reading the blog might make one think that ā€˜reasoning has been solved.ā€™ This couldnā€™t be further from the truth, as additional breakthroughs are required to take AI to human-level reasoning (not just scaling o1, as the blog seems to suggest), as even o1 is still incapable of strong induction, deductive, or, above all, abductive reasoning capabilities. This Sunday, Iā€™ll go into much greater detail on why thatā€™s the case.

Furthermore, their bias toward their investments is also palpable in the image above, where they entirely ignore model providers like Mistral and Cohere and hybrid-architectures proponents like A121, which could play critical roles, especially outside the US (and thatā€™s without mentioning Chinese labs).

EDGE AI
The King of Small Models is Back

Out of all the frontier AI labs, Mistral has been leading the way in the small model range (sub-10 billion LLMs) for over a year since the release of the landmark Mistral 7B. And to celebrate its anniversary, they have released two new models: Ministral 8B and a smaller Ministral 3B.

Both models score industry-leading results in many popular benchmarks (shown above), such as coding, maths, or function calling (using external functions).

TheWhiteBoxā€™s take:

Edge AI is an often overlooked yet fundamental piece of the AI puzzle. The ability to run LLMs on local devices without requiring Internet access is a crucial security feature that makes these models the optimal solution for companies (as long as performance is adequate).

As LLMs have Internet-level knowledge in their weights, we only need Internet access to talk to them because these models are stored in cloud systems. But if you install the LLM files on your computer or in your in-house GPU cluster, you donā€™t need access to the Internet (unless you want the model to access web browser APIs to retrieve recent knowledge).

Still, these models usually underperform large models; scale is unequivocally proportional to performance. However, Mistral focuses a lot on these models ā€˜tunableā€™ nature. In other words, they expect you to fine-tune these models to your particular tasks, aka retrain them on your task-specific data, following a similar approach to the one I shared with my Premium subscribers above.

WAR & ETHICS
Should AIs Kill?

Silicon Valley loves talking about the future. Now, it is immersed in a valley-wide discussion on whether AI should be allowed to decide to kill or whether a human should ultimately give the order.

And before you dismiss this as small talk, itā€™s a crucial debate on the future of AI warfare.

This conversation has several nuances that must be addressed. For starters, we donā€™t know how these models work internally. In fact, thereā€™s an entire AI field focused on this, Mechanistic Interpretabilityof which we talked a while back in this newsletter (Itā€™s a fascinating read, but long story short, we still know very little about the ā€˜insidesā€™ of LLMs). This makes AIs unpredictable, making them a bad choice in first principles (although still very convenient).

Also, if the AI shits the bed and kills a civilian, whoā€™s at fault? The Commander? The Contractor, like Anduril? Or OpenAI for developing the AI? These questions become particularly relevant should AI face moral dilemmas.

A fascinating way to observe this for yourself is MITā€™s Moral Machine, where you are presented with dilemmas that are conceived to force you to make tough decisions, including killing someone to save three others.

For instance, if youā€™re a Tesla AI engineer, what should you train Teslaā€™s Full Self-Driving software to do in the following scenario: kill the driver or kill the runner?

Source: MIT

At the end of the day, we must ask ourselves: Do we want AI to play God? Decide who lives or who dies, all for the sake of efficiency and accuracy?

TREND OF THE WEEK
Intellect-1, The Future of AI Training?

Is decentralized AI training becoming a reality?

While Big Tech pours $50 billion a quarter into building AI data centers and buying or building nuclear power plants to train the frontier AI models of the future, a group of researchers is trying to make this effort, one expected to be of the most lucrative business models in history, accessible to everyone, including you and me.

In other words, we could one day also participate with our data and compute (and get paid for it) to build the next frontier AI model, even if we are at opposite ends of the world.

To achieve this vision, they are currently training Intellect-1, the first 10-billion-parameter decentralized-trained LLM. In fact, training is currently underway as you read this newsletter.

Now, after achieving the first milestone of training a one billion-parameter model this way, the team is ready to make history again with a model ten times the previous size.

But why is this important? Itā€™s not only important; itā€™s a fundamental development that could change the course of the industry and the future of AI economics, obviously impacting Big Techā€™s future plans.

With todayā€™s trend of the week, youā€™ll learn:

  • the basics of foundation model training in a jargon-free, easy-to-follow manner,

  • the challenges Big Tech is facing to scale model sizes,

  • and also how the worlds of AI and blockchain could one day represent the ultimate form factor for you and me to reap the economic benefits of AI.

So, What is Decentralized Training?

Not to be confused with distributed training (training distributed between at least 2 GPUs, but done over thousands for frontier AI training), decentralized training refers to techniques in which a group of independent players get together to train an AI model instead of one player, like Microsoft or Google, performing the entire training.

In other words, distributed training divides training between two or more GPUs, ideally close to each other, while decentralized training involves several actors instead of just one (and, by definition, is also distributed).

But what do we mean by ā€˜trainingā€™ in the first place? How do we train a model like ChatGPT?

Weights and Gradients

When we talk about ā€˜AI models,ā€™ we mostly refer to neural networks. These are groups of variables, called neurons, that, combined, help approximate the real relationship between an input and an output.

  • ChatGPT maps a set of words in a sequence into the next word of that sequence.

  • Image generators map an image request (ā€˜draw me xā€™) into the actual image that represents that request.

  • As we saw on Tuesdayā€™s Premium news report, scientists have created an ā€˜AI tongueā€™ that maps brain signals in the gustatory cortex to liquids, meaning the tongue can differentiate Coke from Pepsi (yes, robots now have taste buds).

  • Neuralink is mapping brain signals into cursor actions so that a paraplegic can interact with computers using his/her mind.

And the list goes on. And while the use cases vary tremendously, the core principle is the same: a neural network that learns the ā€˜mapping,ā€™ or causal relationship, between twoā€”or moreā€”independent variables.

Simply put, AI models are functions that learn the relationship ā€˜when x happens, y happensā€™ to predict new ā€˜yā€™s from new ā€˜xā€™s. If it sounds boring, that means Iā€™m doing a good job explaining it; in first principles, AI is far simpler and unamusing than what incumbents will make you believe.

ā

Itā€™s not magic; itā€™s statistics on steroids.

And how do these models learn? They learn an approximation of the real relationship between inputs and outputs by fitting a ā€˜parametric curve.ā€™ In other words, neural networks represent a huge mathematical equation with billions of variables that we can ā€˜adaptā€™ to predict the desired output. And this adaptation is what we call ā€˜learning.ā€™

In first principles basis, itā€™s not much different from fitting a straight line, {y=mx +c} through two points (x1,y1) and (x2,y2); by finding constants ā€˜mā€™ and ā€˜cā€™, we find the line that connects both points. Neural networks are the same idea, but billions instead of just two variables.

To make this adaptation (training), we need two things:

  1. A global signal loss: How the AI model performs in its predictions.

  2. A local gradient: For each learnableā€”adaptableā€”parameter in our model, compute whether itā€™s affecting the prediction positively or negatively, and by how much.

Regarding the former, we always define a learning objective (in ChatGPTā€™s case, the probability it assigns to the correct next word), which shows how well a model performs at every training step.

For the latter, we compute the gradient, or rate of change, of every parameter for every prediction with respect to the loss; once we have computed the loss of a prediction (how well the model performed), we can trace that loss signal across the network (a process known as backpropagation) and calculate the impact of each parameter.

In laymanā€™s terms, we can trace the effect of each parameter in the loss. If the gradient is positive, that means that larger values of that parameter will lead to greater losses, and vice versa. Also, as we are computing the gradient (the derivative) of that variable, we also know the ā€˜rate of changeā€™ or how much or how little that variable is affecting the loss.

Knowing this, we can update every variable to decrease its impact on the loss. Over time, this decreases the loss, meaning the model is learning.

This exact process is identical for every neural network, meaning that between ChatGPTā€™s next-word prediction capabilities and Neuralinkā€™s ā€˜moving a screen cursor with the mind,ā€™ we only change the data and the learning objective; all other things remain the same.

Even the architecture is fairly similar as Transformers are extensively used in all areas of AI today.

Now that we know how a neural net learns (in theory), now weā€™ll analyze how they learn in practice, one of the toughest engineering problems in the world right now.

The Greatest Headache

As we saw last Sunday, LLMs are very large, meaning they must be broken into parts and distributed across several GPUs (thousands in some cases).

To make matters worse, not only do you have to break one replica (one instance of the model being trained) into multiple GPUs, but if the amount of compute and training data is too large, you may need several replicas, each with its group of GPUs, training in unison, which is precisely the case for models like Llama 3.1, GPT-4, Gemini Ultra, and so on.

For instance, Llama 3.1 405B was trained on 24,000 GPUs, with a replica of the model every 16 GPUs. Every 16-GPU pod updates the parameters of its replica and, every once in a while, they share their learnings with the other 1,500 pods.

As you can appreciate, this is a very complicated task that requires continuous communication between pods and sharing a lot of information between them. It also requires pods to be collocated (very close to each other), even in the same building, if possible.

All things considered, for all this to work well, it requires:

  • Super expensive high-bandwidth cables between GPU clusters, forcing GPUs to be closer together.

  • Synchronous updates. Until all pods have the exact new model state, they must wait for potential lagging pods, leading to the main bottleneck in LLM training. Simply put, one single GPU failure could delay the entire run in a potential 100,000 training cluster like Elonā€™s Colossus. To make matters worse, this is not uncommon, as we saw in Metaā€™s research no Llama 3.1, where it represented 30% of interruptions (page 13) and delaying training three times over the theoretical value (from 17 days to 54).

In that state, itā€™s impossible to train models across continents because high communication leads to endless training runs. So, how does Intellect aim to solve this issue? To do so, they leverage the Open DiLoCo distributed training algorithm and the Prime decentralized training framework, both heavily influenced by Googleā€™s DiLoCo.

But how is Intellect-1 being trained in a nutshell?

The Intellect-1 Approach

Specifically, the process is as follows:

  • Inner loop: GPUs are divided into pods, each receiving a chunk of the data and a replica of the model (this is the same as before). The pod processes the batch and computes the gradients, updating the replicaā€™s parameters. This loop requires GPUs in a pod (node) to be closely interconnected, as we will perform hundreds of sequential updates to the model parameters. Eventually, all GPUs in a pod share their updates and output a general value for the entire pod.

  • Outer loop: After a set of inner loop updates, each pod shares the new values of its parameters with the other pods in the global GPU network. Each pod receives the updates of the other pods and performs an average, so the actual new value of the parameters is the average between the values obtained from all pods. This allows all pods to start from the same model state to perform the next global training step.

In other words, instead of having pods communicating continuously, we train models using a low-communication approach.

Different pods, which may be located thousands of miles apart, can perform extensive local training (Intellect-1 is being trained with 500 local steps for every global update) before communicating their learnings to the other pods. This ensures that the greatest bottleneck, global communication, occurs as little as possible.

Another exciting feature is that they also incorporate a capability for GPUs to come online and offline without affecting global training. In other words, you can participate in the training if you want to and have GPUs available right now.

Lots At Stake for AIā€¦ And Crypto

If successful, Intellect-1 may pave the way for future larger decentralized AI training runs that could one day become the norm. If successful, the engineering feat would be so impressive that, to me, this wouldnā€™t be engineering any more; this is basically art.

Of course, you may be wondering, whatā€™s the role of blockchains here? While the Crypto industry is very scammy, the technology has a clear and valuable raison dā€™ĆŖtre that economically incentivizes decentralization.

In distributed settings, the more decentralized a network is, the safer it is. Thus, through compute power (proof-of-work, like Bitcoin) or asset staking (proof-of-stake, like Ethereum), the distributed nodes in the network secure it by actively validating transactions occurring in it. This builds a ledger that, if truly decentralized (which is rarely the case in Crypto), is almost impossible to tamper with.

This ledger (transaction registry) is what we call the blockchain. And why do cryptocurrencies exist? Simple: to economically incentivize the nodes in the network to participate in the security and get paid for it.

In a nutshell, if decentralized training becomes a thing, blockchains will be the tool in charge of economically rewarding participants for their input, be that data or compute.

Overall, we have just skimmed the surface of the underlying complexity of training a huge AI model in a decentralized manner, including communication quantization, custom code abstractions, and other features that would have made this newsletter 10k words long.

But I hope you at least now have a nice intuition of how LLMs are trained and the importance and impact decentralized AI training could have on the industry and theā€”yourā€”economy in general.

THEWHITEBOX
Premium

If you like this content, by joining Premium, you will receive four times as much content weekly without saturating your inbox. You will even be able to ask your questions you need answers to.

Until next time!