• TheTechOasis
  • Posts
  • Trump & AI, Real-Time Worlds, GenAI Corporate Spending, & More

Trump & AI, Real-Time Worlds, GenAI Corporate Spending, & More

In partnership with

For business inquiries, reach me out at [email protected]

THEWHITEBOX
TLDR;

  • 🇺🇸 Trump’s Election Win & AI

  • 🤩 Meta’s Monstrous Llama 4 Training Cluster

  • 🥸 GenAI Corporate Spending

  • 😎 Google’s ‘Big Sleep’ Finds Serious Bug

  • 📰 Other news from Microsoft & Anthropic

  • [TREND OF THE WEEK] Oasis, Creating Worlds in Real Time

Who really owns your audience?

Being a Creator has never been easy, but unpredictable algorithms make connecting with your audience on social media harder than ever.

Enter beehiiv, the newsletter platform used to send this very email.

beehiiv frees you from the algorithms, giving you the tools to connect and create a more direct relationship with your followers.

Plus, with a network of premium advertisers and paid subscription options, you can tap into new revenue streams from day one.

NEWSREEL
Trump’s Election Win & AI

As you may know by now, Trump earned a decisive victory in this week’s US Presidential election. While I’m not from the US, let alone a political expert, this win impacts AI heavily, as media like The Times have called this election the ‘first AI election.’

While the real effect the Trump administration will have on AI is still pretty much to be defined, we can predict many outcomes already:

  • We should see a complete derogation of Joe Biden’s AI Executive Order, which forced companies’ training models to surpass GPT-4’s budget threshold, around 2.1×1025 FLOP (mathematical operations), or 21 trillion trillions of operations, to provide detailed information on the training run to the government.

If that feels like a lot of compute, it is… but nothing compared to what’s coming. According to EpochAI, we could see models trained with 200 quadrillion trillion (2×1029) operations by the decade's end.

To put that number into perspective, scientists estimate Earth has 7.5 sextillion grains of sand, or 75×1017. That theoretical model would be trained with the compute operations equivalent to the number of sand grains in… 26.6 billion Earths. That’s a lot of sand… I mean compute!

  • Considering that one of Trump’s biggest campaign slogans was ‘America First,’ we should see severe deregulation in energy generation and transmission capacity requirements. In other words, we will see a massive uptick in data center build-outs inside the US. Trump definitely wants to see AI models not only designed in the US but trained and served inside its borders.

  • Discontinuation of the CHIPS Act, a set of hefty subsidies, especially for US chip manufacturers like Intel to revamp their decaying businesses. Trump called this act “very expensive,” so we could expect more deregulation and tax credits (in line with point two) than subsidies.

But the biggest ‘if’ comes from open-source. Trump is notoriously anti-China. Therefore, if AI models become a matter of national security, Trump could enforce much tighter scrutiny on open source to prevent it from falling into China's hands. And news like the ones coming from Meta won’t help.

But this is such a crucial topic that we will devote this Sunday’s newsletter issue to delving into the Trump era's winners and losers.

LLM TRAINING
Meta’s Monstrous Llama 4 Training Cluster

In Meta’s earnings call, Mark Zuckerberg mentioned that they were training Llama 4 “on a cluster that is bigger than 100,000 H100 AI GPUs, or bigger than anything that I’ve seen reported for what others are doing.”

He then mentioned Llama 4 would have new modalities, stronger reasoning, and be much faster. That’s a big claim. The previous generation, Llama 3.1 models, were trained on a 24k NVIDIA H100 cluster, a cluster four times smaller than the one being used by the new version.

TheWhiteBox’s takeaway:

Let’s put these numbers into perspective. The largest known data center in the world is xAI’s Colossus data center in Memphis, Tennessee, a fact acknowledged not only by Elon but also by NVIDIA.

That means that Zuck’s claims suggest Llama 4 is being trained across several data centers with a combined GPU equivalent of 100,000 H100 GPUs. If they pull it off, this would undeniably become the most extensive known distributed training run ever, as we don’t have available information as to what OpenAI, Anthropic, Google, and xAI are doing regarding GPT-5, Claude 4, Gemini 2.0, and Grok-3, respectively.

As mentioned, this cluster is 4 times bigger than the one they used to train Llama 3.1, but does that mean Llama 4 will be four times larger? No. A while back, Zuck mentioned that Llama 4 would require 10x more compute, putting its training budget at 4×1026, or twenty times GPT-4’s budget.

While that could suggest a model ten times larger, what’s presumably going on is that the training data set has grown tremendously (data size is more important than model size, which grows to accommodate the increase in data size). Therefore, Llama 4’s more reasonable new size would be around 2 or 3 times Llama 3.1 405B, so just over a trillion parameters large.

AI ADOPTION
Generative AI Corporate Spending

The Information has released a comprehensive list of Generative AI corporate spending across 50 companies, including Coca-Cola, PwC, and Walmart.

TheWhiteBox’s takeaway:

While some companies invest tens of millions of dollars a year, such as Walmart or Intuit, these numbers are still laughably small compared to the hype surrounding these products.

Take Walmart, for example. Its GenAI spending of around $24 million represents 0.0036% of its total revenues of $665 billion. While the list of companies using these products is extensive, signaling some level of adoption, the commitments are still very small.

If we move into the application layer, with examples like Microsoft 365 Copilot, we do see adoption by many relevant clients (Blackrock, >$6 million, EY >$48, or Goldman Sachs, >$2), but it still feels abnormally small for a technology that has single-handedly sustained the US stock market for the last two years by adding trillions of dollars in market value to Big Tech companies, which are all suddenly ‘AI-first companies.’

At the end of the day, the mantra still holds: Generative AI is easy to demo and get started, but extremely hard to productionize.

And it’s showing.

CYBERSECURITY
Google’s ‘Big Sleep’ Identifies Bugs

According to a Google disclosure, one of its AI LLMs, known as ‘Big Sleep,’ has found a severe bug in SQLite, an open-source database engine. As the researchers explained in a blog post, they believe “this is the first public example of an AI agent finding a previously unknown exploitable memory-safety issue in widely used real-world software.”

If you’re familiar with this space, you may realize this is not the first time AI has been used to uncover bugs; back in August, another team found another SQLite in a DARPA event, and Machine Learning has been used for this purpose for quite some time.

But then, why all the buzz? Well, this time, it was an LLM. Okay, so what?

With LLMs, it is always not about what they do but what they promise to do in the future. In other words, while we do have AIs that beat LLMs in specific tasks, LLMs promise that, eventually, we could have one single model that outcompetes all other models.

Therefore, whenever you see LLM-based news, don’t ask yourself whether AI has already done that because it probably has. Instead, acknowledge that this is news because a new skill, bug correction, is emerging from these models. That’s what gets researchers excited.

Well, that and also because their money and that of their companies are tied to the future success of LLMs.

Other News

TREND OF THE WEEK
Oasis, Creating Worlds in Real Time

Imagine playing in an ever-changing and novel world that adapts to every decision you make. Video games that do not require code repositories or gameplay logic to work.

They simply work by predicting the next frame in the scene based on your actions, thanks to AI.

That’s what a start-up called Decart, which Sequoia Capital invested in, intends to do with a collaboration with another of the great disruptors in the industry, Etched (who wants to kill GPUs for good), both releasing a free-to-play video game demo of Minecraft that, using their latest Oasis model, is literally a door to the future.

But today’s trend of the week offers much more:

  • You will learn how Oasis works and what makes it different—and similar—to other popular models like OpenAI’s video generation model Sora and, interestingly, its fascinating relationship with ChatGPT. By the end, you will understand better than most the underlying nature of frontier AI video generation. This means that if you understand today’s article, you pretty much know how most frontier video generation AI models work.

  • You will learn about Etched, the most visionary contender to NVIDIA’s throne.

  • And by looking into the work of two of the most promising AI start-ups, you will broaden your perspective of what’s possible with AI.

Let’s dive in!

Real-Time Video Generation

Video generation AI models are the most complex of all frontier models. These take two forms: generating videos frame-by-frame or the entire video in one go.

While today’s protagonist is part of the former, I think the best way to comprehend how it works is to draw an analogy with the latter, which is much more prominent currently, with examples we’ve seen in this newsletter many times, like OpenAI’s Sora, Pika, Runway, etc.

Therefore, today, you’ll learn how both work. All in one 15-minute article.

Latent Diffusion Transformer

Before discussing the core of both models, we need to define them. Both Oasis and Sora are latent diffusion models (this will make sense in a minute).

In a nutshell, whenever these models receive an input video, they represent it as a set of numerical vectors, known as embeddings in AI lingo.

But what is an embedding really?

To illustrate this intuitively, video games are a prevalent place where things are represented as lists of numbers. In NBA video games, players are assigned skill attributes like speed, dribbling ability, or height that define them as players, as we can see below with Curry.

Stephen Curry’s NBA 2k25 player card. Source

Crucially, these are the important attributes, and whether Curry is vegan isn’t computed as part of the list.

These are vector embeddings in a nutshell: representing world concepts in a way that machines can compute (in numerical form) while helping them focus on the attributes ‘that matter.’

In other words, embeddings capture the key attributes of real data.

To do this in video models, we use autoencoders. These AI architectures take the input and use an ‘encoder’ to transform it into its latent vector, depicted below by OpenAI as a series of 3D-shaped embeddings.

Source: OpenAI

Why are these embeddings 3D?

Simple, because they represent both the spatial features of the video (what each frame shows) and the time features (how elements in the video evolve frame over frame).

Then, a ‘decoder’ is used to reconstruct the latent vector back to the original data, forcing the model to learn the patterns in data to reconstruct it.

Source: OpenAI

In the image above we can see the diffusion process in full effect, a process we describe in the next section. Also please bear in mind that the reconstructed video would be the fish on the first image, but OpenAI simply used two different videos to show the two-step process.

Long story short, video generation models use autoencoders to compress data into vectors and then reconstruct it back. The intuition is that, much like breaking a puzzle into pieces and then reassembling it, you need some input knowledge to reconstruct it. In other words, learning to reconstruct inputs leads to learning about the inputs.

But autoencoders need an extra piece: the core. And here, again, Sora and Oasis are identical.

DiTs, the Golden Standard for Video Generation

At the core, both use diffusion transformers (DiTs) to generate video, once again proving the Transformer architecture's pervasiveness across all facets of AI, as we discussed on Sunday.

Specifically, diffusion transformers are the merge of two worlds:

  • Diffusion. These models learn to predict noise in an image or video (concatenation of images) and erase the noise to be left with the final image or frame.

As I’ve explained in the past, diffusion models are like marble sculpture artists. They receive the noisy input (the coarse marble), visualize the underlying sculpture, and simply uncover it by conditioning on the user’s input (i.e., “draw a picture of a sitted cat.”).

The essence of diffusion

As one famous Rennaissance artist once said:

“The sculpture is already complete within the marble block, before I start my work. It is already there, I just have to chisel away the superfluous material.”

Michelangelo

But wait, if Sora and Oasis are Transformers, and so is ChatGPT, what makes them different, considering their outputs dramatically differ (text vs video)? Here we draw another distinction:

- ChatGPT is an autoregressive transformer, a model that takes an input and predicts a continuation to that input one token (word) at a time.

- Sora/Oasis are diffusion transformers that, as we’ve seen, takes a noisy input and eliminates the noise to unearth the output.

Hence, while both process data almost identically, the sampling mechanism (how they generate outputs), changes.

At this point, you may think that OpenAI’s Sora and Oasis are identical. But then, why can’t I generate playable games with Sora? What makes Oasis unique? Here’s where we draw the key distinction:

In fact, Oasis is a mixture between ChatGPT and Sora.

Bridging Both Worlds

Simply put, Oasis is both autoregressive and diffusion-based at the same time.

In other words, it’s a diffusion model like any other video generation model, but instead of generating the entire video in one go as Sora does, it generates the new video one frame at a time autoregressively based on previous frames and the next user action (just like ChatGPT generates words one at a time based on previous words), making the video interactive (playable).

But how do we do this? Decart uses a new method called Diffusion Forcing, introduced by MIT researchers a few months ago.

The Intuition Behind Diffusion Forcing

Without going into much detail for the sake of length, the idea is that we will generate each new frame based on the past frames (just like ChatGPT predicts a new word solely based on previous words) while also making the model predict the new frame based on incomplete data.

In other words, instead of having the previous frames in pristine form to predict the new one (ChatGPT requires previous words to be decided before predicting the next one), in Oasis’ case, the previous frames may still be in the process of being fully rendered while the next one starts to be generated.

This turns the model’s predictions into a task of predicting new data based on a partially-observable past, which means that the model has to be able to predict the future based on incomplete or suboptimal input data.

But why? Aren’t we making the task harder? Well, yes, but that’s precisely what you want, because we get the best of both worlds:

  • By going autoregressive, we can predict the video frame-by-frame, which is ideal for the user to interact with (decide) what the future is.

  • But by making the past noisy, we can also train the model in a ‘harder’ training regime that forces the model to learn more powerful video patterns, thereby leading to more robust predictions.

Visually, we can see below that as frame number five (right) is being denoised (generated), the previous images still have some noise in them. Therefore, the model has to use sub-optimal forms of the first four frames to predict frame number five, leading to more robust training.

This may sound weird, but look at it this way:

A model that learns to predict in sub-optimal (harder) situations learns and predicts better.

All this leads to a playable demo you can try for yourself here. Be aware that the video game will appear buggy because it has just three seconds of memory (as three seconds go by, the model forgets what the user had done previously), but bear in mind that every new frame you see in the video is being autoregressively generated but an AI without using code or any type of logic.

Luckily, the promise of scaling this model to larger model sizes and, as the partnership with Etched implies, more powerful hardware like Sohu, which promises a truly enormous computing power improvement regarding the current state-of-the-art NVIDIA’s H100 GPUs, could lead to an entirely new family of AI products.

But what is Sohu?

The First Transformer-based hardware

Sohu is poised to be the first Transformer chip designed exclusively to run Transformer architectures. It’s not general-purpose like GPUs; it’s only meant for Transformers, making it ideal for running these architectures but useless for anything else.

And it shows. While the H100 only manages video generations at 20 frames per second, 720p definition videos, and a model just over 500 million parameters in size, Sohu could theoretically allow 30 fps, 4k resolution video games, and models at 100 billion-plus parameters.

Source: Etched

All things considered, Oasis could be regarded as the first-ever real-time AI-generated playable game engine. It is the precursor of a future where we can create playable games from scratch with no code or logic required, a real eye-opener of what the future awaits.

Closing thoughts

Another week passes, and another AI use case blows our minds. Still, we must acknowledge that AI continues to be a demo-only technology in most cases, as true applicative value is still missing.

And no data hits harder than seeing eager-to-adopt companies like the ones listed above only spending a fraction of what they could based on AI’s biggest problem: implementation.

But Trump's victory this week is also crucial for the industry at the political and geopolitical level; for better or worse, things will change dramatically from 2025 onward.

Whether that’s good or bad for you or your company, the winners and losers will depend on several key factors we’ll discuss on Sunday.

THEWHITEBOX
Premium

If you like this content, by joining Premium, you will receive four times as much content weekly without saturating your inbox. You will even be able to ask your questions you need answers to.

Until next time!